Giles Weaver PyData Global 2023

Giles Weaver
.ical

Data scientist. Domain expertise in maritime shipping (AIS). User of PySpark & Dask for over five years. Formerly a bioinformatician. Available for contract work.

Sessions

12-07

14:30

30min

Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine

Ian Ozsvald, Giles Weaver

Pandas 2 brings new Arrow data types, faster calculations and better scalability. Dask scales Pandas across cores and recently released a new "expressions" optimization for faster computations. Polars is a new competitor to Pandas designed around Arrow with native multicore support. Which should you choose for modern research workflows? We'll solve a "just about fits in ram" data task using the 3 solutions, talking about the pros and cons so you can make the best choice for your research workflow. You'll leave with a clear idea of whether Pandas 2, Dask or Polars is the tool to invest in and how Polars fits into the existing numpy-focused ecosystem.
Do you still need 5x working RAM for Pandas operations (probably not!)? Can Pandas string operations actually be fast (sure)? Since Polars uses Arrow data structures, can we easily use tools like Scikit-learn and matplotlib (yes-maybe)? What limits do we still face? Could you switch to experimenting with Polars and if so, what gains and issues might you face?

Data Track

Giles Weaver .ical

Sessions

Giles Weaver
.ical