PyData Global 2023

Event-Driven Data Science: Reconceptualizing Machine Learning for the Real-time World
12-07, 14:30–15:00 (UTC), Machine Learning Track

Did you know that 87% of data science projects never make it into production? While open source libraries like scikit-learn and TensorFlow are have gone a long way to democratize data science, they are also unintentionally limited by the assumptions and research focus of academia at the time they were released. One such assumption is that a model must be trained on batches of data and that all machine learning models need more data in order to perform well. This introduces a gap between training and inference as there is a requirement to accumulate enough instances for training. For real-time use cases such as anomaly detectors, models can become stale even before they get deployed to production.

Fortunately there has been a trend towards building machine learning models that are geared towards learning from streams of data and that can react immediately to changes in data. This form of learning is usually referred to as real-time machine learning, online learning, or incremental learning.

In this talk, we will compare the two approaches to machine learning, provide a brief overview of River, a library for building online learning models, and demo a real-time application using PyEnsign, a real-time data streaming client.


The common paradigm of batch modeling introduces a disconnect between development and production. Data scientists often discover that what worked in development does not work in production. In reality, data often comes in streams instead of large batches and is constantly evolving. As a result, models tend to lose accuracy and inferential power after being deployed into production.

Real-time machine learning aka online learning is an alternative approach to help models quickly adapt to changing data sets. These models learn and make inferences on data as soon as they arrive. They may struggle initially but are able to adapt much more quickly than traditional batch models. Moreover, data scientists are able to build applications that line up more closely to production scenarios.

Data streaming architectures make it possible to build real-time machine learning applications that are more resilient, less error prone, and less susceptible to drift. Imagine how much more robust your applications would be if they were not only trained on the freshest data, but they could alert you to drifts as soon as they happen -- you'd be able to react immediately as opposed to a batchwise process where you'd be lucky to catch the issue within a day!

This talk is for data scientists and engineers who are interested in creating real-time machine learning applications without leaving the comfort of Python. We discuss the key weaknesses of the batchwise modeling approach and demonstrate how real-time models help address these weaknesses. We will then build a simple real-time machine learning application using River and PyEnsign!


Prior Knowledge Expected

No previous knowledge expected

Prema Roman is a distributed systems engineer at Rotational Labs. She is an experienced software, data, and machine learning engineer with a proven track record of building high quality software applications and data products. Her passion for continuous learning has taken her a long way from her start as a data analyst, as she takes on new challenges at Rotational Labs building globally distributed systems and machine learning data products.

Patrick Deziel is a distributed systems engineer and machine learning specialist. Patrick has extensive experience building and maintaining mission-critical systems in the private sector, as well as integrating modern ML solutions into existing applications. At Rotational, he designs and builds intelligent distributed systems to enable global use cases. In his free time, Patrick enjoys rock climbing and consuming science fiction.