Talk

Is the great dataframe showdown finally over? Enter: polars

Sunday, May 28

14:30 - 15:00
RoomPizza
LanguageEnglish
Audience levelIntermediate
Elevator pitch

Every dataframe library that came after pandas promised an expressive API and better performance, yet none replaced the fluffy bamboo-eater 🐼 Or has it? Enter polars: will its multi-threaded, in-memory query engine be enough to dethrone the king? Come and learn to tame this new artic beast!

Abstract

We all love pandas - at least as much as we are aware of its limitations. No, I am not talking about the “setting a view versus a copy” warning - we’re talking performance.

Eager execution plays nice in notebooks, but is a burden in production. Moreover, its single threaded nature limits significantly its scaling capabilities. Improving on pandas seems easy - in fact, there are multiple, successful libraries out there. Yet pandas is still ubiquitous. To be fair, it set an incredibly high bar, also thanks to its expressive, high-level API.

No one ever made a mystery of pandas limitations. Even its creator, Wes McKinney, said that “my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset”, while the docs clearly recommend that when performance is a issue “it’s worth considering not using pandas.

Well, a new contestant is getting traction, and it looks like it might end the great dataframe showdown: polars, an artic beast with a blazingly fast multi-threaded query engine, written in rust.

polars supports both lazy and eager evaluation, as well as larger-than-memory (streaming) data processing. It leverages Apache Arrow’s columnar format, and offers zero-copy to and from pandas or numpy arrays. It also supports reading from Delta tables. Not only polars packs quite a punch in terms of performance, but also offers an intuitive and elegant API, addressing some problems with pandas expressions using a familiar and pythonic syntax.

Let’s be clear: polars is here to stay. And maybe you’d better know how to tame it.

📍 Keynote outline

  1. pandas: when’s the time to look out for an alternative?
  2. polars: what’s the use case?
  3. polars and pandas: differences, similarities and benchmarking
  4. polars: lazy evaluation and streaming APIs
  5. polars: working with Delta tables
TagsBig Data, Pandas, Performance, Multi-Threading
participant photo

Luca Baggi

ML Engineer at Futura, when I am not preparing lectures in statistics and machine learning, I juggle between closing some of my 300+ open tabs on the browser and my true passion: collecting stars on GitHub 🔭🌟 In this treasure trove of more than 2,000 repositories, I am pretty sure I can find any tool to solve a problem, and I can’t wait to share them with you.