PyData Global 2023

Bridging Classic ML Pipelines with the World of LLMs
12-07, 20:00–20:30 (UTC), Machine Learning Track

You probably don’t need a fancy new tool to take advantage of LLMs. While the explosion of inventive AI applications feels like a massive leap forward, the core challenges in plugging them into the business represent an incremental step from the discipline of MLOps.

The challenges are largely equivalent. Retrieval augmented generation is effectively a recommendation system. Agents are the control flow of your program. Chains of LLM calls are simple DAGs. And you’re still stuck trying to monitor quantitatively unclear predictions, wrestle expensive, unstable APIs into submissions, and build out and manage complex dataflows.

The toolbox, as well, remains similar. In this talk we present the library Hamilton, an open source microframework for expressing dataflows in python. We show how it can help you build observable, stable, context-independent pipelines that span the gamut from classical ML to LLMs/RAG, enabling you to maintain sanity and keep up with the pace of change as everyone steps into the fascinating new world of AI.


In this talk, we will:

  1. Argue that you don’t need over-engineered abstractions to adopt LLM tooling into your MLOps pipelines.
  2. Show how the open-source framework Hamilton can help bridge the gap, allowing you to build and run production-ready LLM & ML pipelines in a variety of different contexts.

More specifically, we will:

  1. Provide an overview/comparison of the challenges with ML and LLM pipelines
  2. Convince you that directed acyclic graphs (DAG)s as core to modeling both worlds
  3. Talk about how the open source framework Hamilton allows you to describe both ML & LLM pipelines
  4. Dig into examples, showing how you can use Hamilton to build self-documenting, highly customizable pipelines, providing a unifying approach to building (and even combining!) ML & LLM pipelines.

Prior Knowledge Expected

No previous knowledge expected

Elijah built large components of the simulation/trading infrastructure at Two Sigma, and led a team to test/ensure the reliability of their quantitative code. He then built out the ML platform at Stitch Fix that was used by 100+ data scientists (see https://multithreaded.stitchfix.com/blog/2022/07/14/deployment-for-free/). Most recently he co-authored the open source library Hamilton, a general-purpose lightweight framework for building dataflows in Python. Due to the success/possibilities presented by Hamilton, he left his job at Stitch Fix and started DAGWorks, with the goal of making it easy for Data Scientists to build and manage machine learning ETLs.

A hands-on leader and Silicon Valley veteran, Stefan has spent over 15 years thinking about data and machine learning systems, building product applications and infrastructure at places like Stanford, Honda Research, LinkedIn, Nextdoor, Idibon, and Stitch Fix. A regular conference speaker, Stefan has guest lectured at Stanford’s Machine Learning Systems Design course and is an author of a popular open source framework called Hamilton. Stefan is currently CEO of DAGWorks, an open source startup that is enabling teams a standardized way to build and maintain data, ML and LLM pipelines without the coding nightmares.