High speed data from the Lakehouse to DataFrames with Apache Arrow
In 2023, with the introduction of Pandas2, Apache Arrow became the dominant standard for both in-memory representation and over-the-wire transfer format for data in DataFrames.
In this talk, we will examine the performance benefits of using Apache Arrow end-to-end from the data lake or warehouse to client-side DataFrames. We will demonstrate in Python examples how data can now be moved between Pandas2, Polars, and DuckDB at no cost (zero-copy) and we will look how Arrow enables the replacement of row-oriented APIs for data retrieval (JDBC/ODBC) with column-oriented protocols (Arrow Flight and ADBC). We will show how we built a query service that bridges the data lake with Python clients. DataFrame clients can read data using a network hosted service that reads Arrow data from Parquet files, processes the data in Arrow format, and transfers the data to clients using Arrow Flight service. We will also look to a file-free future for DataFrames, where they can be easily stored and updated in a serverless platform.