PyData Global 2023

High speed data from the Lakehouse to DataFrames with Apache Arrow
12-08, 13:00–13:30 (UTC), Data Track

In 2023, with the introduction of Pandas2, Apache Arrow became the dominant standard for both in-memory representation and over-the-wire transfer format for data in DataFrames.
In this talk, we will examine the performance benefits of using Apache Arrow end-to-end from the data lake or warehouse to client-side DataFrames. We will demonstrate in Python examples how data can now be moved between Pandas2, Polars, and DuckDB at no cost (zero-copy) and we will look how Arrow enables the replacement of row-oriented APIs for data retrieval (JDBC/ODBC) with column-oriented protocols (Arrow Flight and ADBC). We will show how we built a query service that bridges the data lake with Python clients. DataFrame clients can read data using a network hosted service that reads Arrow data from Parquet files, processes the data in Arrow format, and transfers the data to clients using Arrow Flight service. We will also look to a file-free future for DataFrames, where they can be easily stored and updated in a serverless platform.


DataFrames are widely used for data transformations and analysis. DataFrames have historically been created from files in CSV, JSON, or Parquet file formats. As DataFrames have become more widely adopted for data processing pipelines, they have acquired new capabilities for both reading and writing from/to databases and data lakes.
In 2023, with the introduction of Pandas2, Apache Arrow became the dominant standard for both in-memory representation and over-the-wire transfer format for data in DataFrames. Now, data can moved at little cost between DataFrame frameworks such as Pandas2, Polars, and even distributed DataFrames, such as PySpark. Network protocols such as Arrow Flight and ADBC enable data to flow from columnar datastores, like data warehouses, directly to Pandas2 clients without serialization/deserialization or transformations between row-oriented and column-oriented formats (as required by JDBC/ODBC APIs).

In this talk, we will explore the importance of Apache Arrow in the evolution of data for DataFrames from CSV and Parquet file formats, to tabular formats, such as Apache Hudi, Iceberg, and Delta Lake. We will examine the huge performance benefits of using Apache Arrow end-to-end in integration Python clients with data lakes and warehouses, We will also look to the future at work we have been doing in the open-source Hopsowrks platform, on network-hosted (serverless), incremental tables for DataFrames - backed by Arrow End-to-End.


Prior Knowledge Expected

No previous knowledge expected