Every dataframe library that came after pandas
promised an expressive API and better performance, yet none replaced the fluffy bamboo-eater 🐼 Or has it? Enter polars
: will its multi-threaded, in-memory query engine be enough to dethrone the king? Come and learn to tame this new artic beast!
We all love pandas
- at least as much as we are aware of its limitations. No, I am not talking about the “setting a view versus a copy” warning - we’re talking performance.
Eager execution plays nice in notebooks, but is a burden in production. Moreover, its single threaded nature limits significantly its scaling capabilities. Improving on pandas
seems easy - in fact, there are multiple, successful libraries out there. Yet pandas
is still ubiquitous. To be fair, it set an incredibly high bar, also thanks to its expressive, high-level API.
No one ever made a mystery of pandas
limitations. Even its creator, Wes McKinney, said that “my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset”, while the docs clearly recommend that when performance is a issue “it’s worth considering not using pandas
“.
Well, a new contestant is getting traction, and it looks like it might end the great dataframe showdown: polars
, an artic beast with a blazingly fast multi-threaded query engine, written in rust.
polars
supports both lazy and eager evaluation, as well as larger-than-memory (streaming) data processing. It leverages Apache Arrow’s columnar format, and offers zero-copy to and from pandas
or numpy
arrays. It also supports reading from Delta tables. Not only polars
packs quite a punch in terms of performance, but also offers an intuitive and elegant API, addressing some problems with pandas
expressions using a familiar and pythonic syntax.
Let’s be clear: polars
is here to stay. And maybe you’d better know how to tame it.
📍 Keynote outline
pandas
: when’s the time to look out for an alternative?polars
: what’s the use case?polars
and pandas
: differences, similarities and benchmarkingpolars
: lazy evaluation and streaming APIspolars
: working with Delta tablesML Engineer at Futura, when I am not preparing lectures in statistics and machine learning, I juggle between closing some of my 300+ open tabs on the browser and my true passion: collecting stars on GitHub 🔭🌟 In this treasure trove of more than 2,000 repositories, I am pretty sure I can find any tool to solve a problem, and I can’t wait to share them with you.