PyData Global 2023

Unified batch and stream processing in python
12-08, 16:00–16:30 (UTC), Data Track

Historically it's been difficult to reuse existing batch processing code in streaming application.
Because of this, ML engineers had to maintain two implementations of their jobs.
One for streaming and one for batch.

In this talk we'll introduce beavers, a stream processing library optimized for analytics.
It can be used to run both batch and streaming jobs with minimal code duplication, whilst still being good at both.


Picture this.
You're a machine learning engineer.
You've been working on a new shiny model.
You have a batch job loading historical data and calculating sophisticated features with pandas.
These features are complicated.
They take a lot of merge_asof, join, pivot and ffill.
You've trained your model and the results look promising.

Now it's time to deploy this model to production for real time inference.
That's when you realise that your pandas code does not fit well in your streaming framework.
Preparing features becomes very slow when processing one event at time.
So instead, you have to reimplement your feature preparation in plain old python.
The new version is optimized for processing one event at a time.
But then you have two versions of the same code that are very hard to maintain.
Not an ideal solution...

It would be much nicer to only have one implementation.
Or at least be able to reuse the batch code in the streaming implementation.

In this talk we'll see how to make this possible.

Here's a rough outline of the talk:
- minutes 00 to 05 introduce an example of feature calculation in batch
- minutes 05 to 10 show how difficult it is to convert the batch code to a streaming app
- minutes 10 to 15 introduce beavers (https://beavers.readthedocs.io/en/latest/) an open source library for stream analytics
- minutes 15 to 20 see how to port our batch application to beavers and discuss the different abstraction
- minutes 20 to 30 go in depth in different streaming concepts and abstraction


Prior Knowledge Expected

Previous knowledge expected

After graduating with an engineering degree in 2009, I’ve worked in all four corners of the City of London, for various financial institutions, big and small.
As a software engineer, I specialise in data intensive applications.
I've worked with both real-time systems, and batch jobs.
I have a keen interest in how we can get the two to interact seamlessly.