12-08, 14:30–15:00 (UTC), Data Track
Pandas is loved and venerated for its flexibility and ease-of-use. However, its oft-quoted slowness has prompted many others like duckdb, polars, and RAPIDS cuDF to step in and offer faster alternatives. These are all fantastic tools, but they have non-zero adoption costs, more restrictive APIs compared to pandas, and they don’t always work with 3rd party libraries that use pandas today.
cudf.pandas
takes a fresh approach: instead of trying to be a replacement for pandas, it effectively accelerates pandas on the GPU. cudf.pandas
requires no code changes (not even your pandas imports!), supports 100% of the pandas API, and third-party libraries that use pandas are magically accelerated on the GPU.
If you use pandas today and want to run your code on the GPU with 0 changes today, this talk is for you. If you are the maintainer of a library that uses pandas and you’d like to support GPUs with 0 changes today, this talk is for you. If you’re a Pythonista at heart and enjoy hearing about the proxy pattern and deep import customization, this talk is for you!
First, I’ll begin by doing some reflection on pandas, alternatives like duckdb, polars, modin and cuDF, and the benefits and challenges of those alternatives.
Second, I’ll do a demo of cudf.pandas
. I’ll show you how it magically turns pandas into an unbelievably fast DataFrame library without changing anything about your code. I’ll show you how to use the profiler to understand and improve performance. I’ll also demonstrate some really neat examples of accelerating third-party code without changing it.
Third, I’ll explain how the magic works: how we implement fallback and synchronization of data between cuDF (GPU) and pandas (CPU) behind the scenes, and how (and why!) we hijack the import of pandas. I’ll also explain how we test to ensure compatibility. I’ll talk about known limitations.
Fourth: benchmarks! I’ll show you how cudf.pandas
performs compared to other dataframe libraries on standard benchmarks.
Finally, I’ll touch on some related topics like pandas’ new Arrow-backed data types and the dataframe API standard, and how those relate to cudf.pandas
.
No previous knowledge expected
Ashwin Srinath is a senior software engineer at NVIDIA, and part of the team developing RAPIDS. Prior to joining NVIDIA, he was a computational scientist at Clemson University, helping researchers develop and optimize HPC applications.