PyData Global 2023

Python as a Hackable Language for Interactive Data Science
12-08, 17:30–18:00 (UTC), General Track

Did you know that the core Python syntax and semantics can be tailored for interactive computing use cases? It turns out that more is possible than what you would expect! For example, at the most basic level, Jupyter supports basic syntax extensions like so-called "magic" operations. It turns out, however, that one can go much deeper. In this talk, I'll show that it's possible to augment and abuse Python to support a plethora of interactive use cases. I'll start with the simple example of building an optional chainer for Python (supporting syntax reminiscent of javascript like a?.b()?.c). I'll then show how to use these same ideas to accelerate data science operations, concluding with an example of how to perform full dataflow tracking in order to give users the illusion of dataframe queries that run instantaneously.


The topic: Through the magic of lexer rewrites, parser rewrites, and tracing, the audience will explore some mind-bending aspects of the Python language and its ability to be molded to fit a variety of use cases in interactive computing. This talk is interesting despite the existence of various well-known instrumentation tools for Python such as sorcery (https://github.com/alexmojaki/sorcery) and macropy (https://github.com/lihaoyi/macropy) because I believe it makes these topics accessible and shows how they can be used in particular to accelerate data science.

Audience: The audience that would benefit most from this talk are a) data scientists, who may become aware of some additional tools that make their lives easier, as well as give them some ideas for new tools to ask for, and b) tool designers, to equip them with new techniques for developing data tools.

Takeaways: My hope is that the audience will leave this talk with the temerity to hack Python for their own interactive use cases.

Background knowledge: I will assume that the audience is familiar with Python. Familiarity with some basic concepts from compiler design such as tokenization, parsing, and abstract syntax trees is helpful, but not required.

Rough time breakdown: In the first 10 minutes, I'll introduce the basic idea of token rewrites and AST-level instrumentation, illustrating with some basic examples from IPython / Jupyter, such as magics or top-level await. I'll talk about why AST instrumentation is hard to get right: namely, different transformers seldomly compose with each other, making such transformations bespoke and difficult to reason about. I'll introduce the Pyccolo library and how it supports composable instrumentation, and show how it can be used to imbue Python with optional chaining powers (i.e. a?.b()?.c javascript-style syntax) in less than 100 lines of code.

In the next 10 minutes, I'll show how these same ideas can support other use cases tailored to data science.
Example 1: a dataframe query planner that optimizes its query before actually running it, despite sharing the familiar pandas syntax
Example 2: dataflow tracking to support reactive notebooks in https://github.com/ipyflow/ipyflow.

In the final 10 minutes, I'll conclude with an example of dataflow tracking in conjunction with non-blocking assignment that gives users the illusion of instantaneous dataframe operations, and seed the audience with some ideas of future possibilities.


Prior Knowledge Expected

No previous knowledge expected

I'm an engineer at Databricks where I work on tools and infrastructure for machine learning and data science. I'm passionate about pushing the limits of Python for data science use cases, and would love to chat with other tool developers to learn about the exciting developments in this area. In my free time, besides maintaining a few open source projects, I enjoy spending time with my wife and our cat in our vegetable garden.