PyData Global 2023

Arrow revolution in pandas and Dask
12-06, 16:30–17:00 (UTC), Data Track

The pandas library for data manipulation and data analysis is the most widely used open source data science software library. Dask is the natural extension for scaling pandas workloads to more than a single machine. The continuing integration and adoption of Apache Arrow accelerates historical bottlenecks in both libraries.


Pandas reached an important milestone earlier this year with the 2.0 release, which brought DataFrames backed by Arrow instead of the historical NumPy backend. We will learn how the 2.0 release shapes pandas as a tool for Data Analysis and what future releases will bring, including a more efficient string implementation, general purpose Arrow-backed DataFrames, Copy-on-Write and better integration with the ecosystem.

Dask is a library for distributed computing with Python and integrates tightly with pandas and other libraries from the PyData stack. Dask underwent a number of changes over the last year, which are targeted at more efficient and faster computations. We'll discuss some of these changes in more detail, including:

  • A more efficient shuffling algorithm based on Arrow
  • Arrow-backed string implementations by default
  • A logical query optimization layer for Dask DataFrames

We'll look ahead and discuss what the future holds and these changes and ongoing efforts position Dask in the Big Data ecosystem among competitors like Spark


Prior Knowledge Expected

No previous knowledge expected

Matthew is an open source software developer in the numeric Python ecosystem. He maintains several PyData libraries, but today focuses mostly on Dask a library for scalable computing. Matthew worked for Anaconda Inc for several years, then built out the Dask team at NVIDIA for RAPIDS, and most recently founded Coiled to improve Python's scalability with Dask for large organizations.

Matthew holds a bachelors degree from UC Berkeley in physics and mathematics, and a PhD in computer science from the University of Chicago.

Patrick Hoefler is a member of the pandas core team and a Dask maintainer. He is currently working at Coiled where he focuses on Dask development and the integration of a logical query planning layer into Dask. He holds a Msc degree in Mathematics and works towards a Msc in Software engineering at the University of Oxford.