PyData Global 2023

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
11:00
11:00
60min
Machine Learning Track
11:00
30min
Visualization Track
11:00
30min
Let chatGPT decide and run the function!
sonam

We all know chatGPT is smart, but is it smart enough to choose a function from a query. We will discuss it through openai function calling, and will let it choose the function according to the query. Also, discusses the extracted json output with a demo.

Data Track
Data Track
11:00
30min
Unlock the Full Potential of Jupyter Notebooks
Nir Barazida

If you're using Jupyter Notebooks in your workflow - This session is for you. Learn about practical tips, workflows, and MLOps tools that help many teams, and individuals scale their work, better utilize Jupyter Notebooks, and successfully bring projects from research to production.

General Track
General Track
11:30
11:30
90min
Btune: Making Compression Better
Francesc Alted

Data compression is not a one-codec-fits-all problem. It necessarily involves a trade-off between compression ratio and speed. A higher compression ratio usually results in a slower compression process. Depending on the needs, one may want to prioritize one over the other. The issue is that finding the optimal compression parameters can be a slow process due to the large number of combinations of compression parameters (codec, compression level, filter, split mode, number of threads, etc.), and it may require a significant amount of manual trial and error to find the best combinations.

Btune (https://btune.blosc.org) is a dynamic plugin for Blosc2 that can help finding the optimal combination of compression parameters for datasets compressed with Blosc2 (https://github.com/Blosc/c-blosc2, https://github.com/Blosc/python-blosc2), while significantly speeding up this process.

Data Track
Data Track
11:30
30min
Build a Data Visualization App For Your Phone
Russell Keith-Magee

The modern mobile phone is an incredibly powerful computing device. However, mobile platforms have historically excluded the Python data science community, requiring specialist platform-specific skills, or making the use of Python data science tools exceedingly difficult.

This isn't true any more. In this talk, you'll learn how to build and run a app on your phone that uses the Python data analysis and visualization tools you're already familiar with.

Visualization Track
Visualization Track
11:30
90min
When Design Thinking Meets Opensource
Ramona Sartipi

When it comes to open source contributions design is often a second thought. There is a plethora of innovative open source software that is made with little or no contribution from experienced designers, which often leads to inconsistent interfaces, confusing interactions, and ultimately, a poor user experience. When paired together, strong open source projects and human-centered, empathetic design thinking can create software that users can actually use. This session will explore the opportunities for design in open source projects and how developers can exercise a few design practices to influence the adoption and usability of their projects. It will examine what experience design is, why it matters, and the principles behind effective design. Attendees will learn through hands-on activities how to incorporate design thinking strategies into their projects without sacrificing design and how doing so will result in a better product for their users.

General Track
General Track
12:00
12:00
30min
Building Interactive, Animated Reports and Dashboards in Streamlit with ipyvizzu
Peter Vidos

As data scientists, you understand the power of data, but the true value lies in enabling others to explore and comprehend insights. Join us on a journey where data becomes more than just numbers – a dynamic, interactive story that anyone can engage with.

Streamlit, known for its user-friendly approach to data app development, is now enhanced with the integration of ipyvizzu. This innovative open-source data visualization tool places a strong emphasis on animation and storytelling. This combination empowers data scientists to craft and deploy immersive, animated reports and dashboards swiftly.

Imagine the impact of creating Streamlit apps that allow business stakeholders without data expertise to independently analyze complex datasets, generate custom animated charts, and construct interactive data narratives. It's a game-changer for data-driven decision-making.

Visualization Track
Visualization Track
12:00
30min
Version Control and Beyond: Leveraging Git for ML Experiment Management
Eryk Lewinson

Before finalizing a machine learning model, data scientists conduct dozens, if not hundreds, of experiments. To keep track of these experiments, they employ setups of varying complexity, including physical notebooks, spreadsheets, or even complex configurations using various libraries and dedicated infrastructure. In this practical presentation, I will demonstrate how you and your team can start tracking experiments right away using a very simple setup, with most of the ingredients you are probably already using.

Machine Learning Track
Machine Learning Track
12:30
12:30
30min
Paradoxes in model training and evaluation under constraints
Malte Tichy

In many domains, machine learning methods predict the future demand of some physical good or virtual service that comes with finite capacity. Those predictions are then typically used to plan an appropriate level of supply. Often, it is not possible to directly measure (and to train on) the actual demand, but only on the fraction of it that could be fulfilled under the given constraints in the past – such as finite stocks or limited capacity. That is, one predicts a different quantity than one measures. This talk explores the various surprising aspects of the demand-sales-distinction that can arise in data science projects. We explore the paradoxes and the most dramatic problems that one encounters and find out how to avoid them. This talk will sharpen your thoughts when dealing with such intricate settings, and allow you to create and utilize demand forecasts in the best possible way.

Machine Learning Track
Machine Learning Track
12:30
30min
Solara simplifies building complex dashboards.
Maarten

Many Python frameworks are suitable for creating basic dashboards, but struggle with more complex ones. Though many teams default to splitting into separate frontend and backend divisions when faced with increasing dashboard complexity, this approach introduces its own set of challenges, like reduced personnel interchangeability and cumbersome refactoring due to REST API changes.

Solara, our new web framework, addresses these challenges. We use the foundational principles of ReactJS, yet maintain the ease of writing only Python. Solara has a declarative API, designed for dynamic and complex UI's, yet easy to write. Reactive variables power our state management which automatically trigger rerenders. Our component-centric architecture stimulates code reusability, and hot reloading promotes efficient workflows. Together with our rich set of UI and data-focused components, Solara spans the entire spectrum from rapid prototyping to robust, complex dashboards.

Without modification your application and components will work in Jupyter, Voilà and on our standalone server for high scalability. Our server can run along existing FastAPI, Starlette, Flask and even Django servers to integrate with existing web services. We prioritize code quality and developer friendliness by including strong typing and first class support for unit and integration testing.

Visualization Track
Visualization Track
13:00
13:00
120min
Visualization Track
13:00
30min
More like this: monitoring recommender systems in production
Emeli Dral

How can you make sure that your recommender systems work as expected? Once you put them into production and users start interacting with the model predictions, evaluating the model output quality might become tricky. In this talk, we will explore how to monitor the quality of recommender systems in production, detect data drift, and prevent known model failure modes.

General Track
General Track
13:00
240min
Panel Sprint
Philipp Rudiger, Simon Hansen, Andrew Huang

Join the sprint at https://numfocus-org.zoom.us/j/81665667614?pwd=1x2JKibaHybUAztHkQ64bOSo5fCrCX.1

Philipp Rudiger (https://github.com/philippjfr)
Simon Hansen (https://github.com/Hoxbro)
Andrew Huang (https://github.com/ahuang11)

Sprint
Sprints
13:00
30min
Polars and time zones: everything you need to know
Marco Gorelli

"You should never, ever deal with time zones if you can help it" Tom Scott

Instead, you should let your software deal with time zones for you.

Polars is a Dataframe library with full support for time zones - come and learn how to leverage it to its full potential!

Data Track
Data Track
13:00
90min
sktime - python toolbox for time series: new features 2023 – advanced pipelines, probabilistic forecasting, parallelism support, composable classifiers and distances, reproducibility features
sktime community, Benedikt Heidrich, Anirban Ray, Franz Kiraly

sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack.

This tutorial gives an updated introduction to sktime and presents a vignette slideshow with the most important features added since pydata global 2022.

Machine Learning Track
Machine Learning Track
13:30
13:30
30min
Blosc2: Fast And Flexible Handling Of N-Dimensional and Sparse Datasets
Francesc Alted

N-dimensional datasets are pervasive in many scientific areas, and getting quick slices of them is critical for an improved exploration experience. Blosc2 is a compression and format library that recently gained support for dealing with such multidimensional datasets. Crucially important, by leveraging compression, Blosc2 can deal with sparse datasets effectively as the zeroed parts are almost suppressed, whereas the non-zero parts can still be stored in smaller sizes than non-compressed counterparts. In addition, the new double data partition inside Blosc2 minimizes the decompression of unnecessary data and provides top-class slicing speed.

Data Track
Data Track
13:30
30min
Map of Open-Source Science (MOSS)
Tim Bonnemann, Jonathan Starr

The Map of Open-Source Science (MOSS) is a researcher’s entry point into the world of OSS for science that helps to improve discoverability of existing open-source scientific software tools and the communities around them.

General Track
General Track
14:00
14:00
30min
Fighting Money Laundering with Python and Open Source Software
Gajendra Deshpande

In this talk proposal, we will discuss how to detect the chain of fraudulent transactions and help the investigation agencies by providing useful insights to fight money laundering with the help of Python programming language and packages.

General Track
General Track
14:00
30min
How to build a data pipeline without data: Synthetic data generation and testing with Python
Ruan Pretorius

Data pipelines are essential for transforming, validating, and loading data from various sources into a target database or data warehouse. However, building and testing data pipelines can be challenging when the real data is not available, either due to privacy issues, technical limitations, or simply because the data is not yet collected. How can we ensure that our data pipelines are robust and reliable without having access to the actual data?

In this talk, we will share our experience of creating synthetic data to test data pipelines using Python. We will demonstrate how we used some statistical methods and Python packages such as Faker to generate realistic synthetic data for different use cases, such as customer profiles, transactions, and time series. We will also show how we used Flyway to load the synthetic data into a Postgres database and perform repeatable deployments. We will discuss the benefits and challenges of using synthetic data for testing data pipelines, as well as some best practices and tips for creating and using synthetic data effectively.

Data Track
Data Track
14:30
14:30
90min
Machine Learning Track
14:30
30min
Cloud UX for Data People
James Bourbeau

Cloud UX kinda sucks. It was written for cloud engineers who like very explicit systems, and always read the docs. This makes it a bad fit for data people (data scientists, data engineers, machine learning researchers) who rapidly learn and use several tools on a day-to-day basis. This mismatch in UX expectations results in poor utilization and wasted resources.

This talk goes through the challenges we faced when building a cloud UX for data people, and the kinds of solutions we ended up adopting when supporting Dask (parallel python) in a cloud environment.

General Track
General Track
14:30
30min
Data-Driven F&B Delivery: Jahez as a Leading Example
Basel Alebdi, Nouf Alroqi

The talk will discuss how Data & AI department at Jahez adopts innovative NLP advanced techniques within search queries in the app to identify commercial opportunities by onboarding new restaurants that are highly demanded by Jahez’s customers.

Data Track
Data Track
15:00
15:00
60min
General Track
15:00
90min
Empowering Data Exploration: Creating Interactive, Animated Reports in Streamlit with ipyvizzu
Peter Vidos, Zachary Blackwood

Data scientists strive to bridge the gap between raw data and actionable insights. Yet, the actual value of data lies in its accessibility to non-data experts who can unlock its potential independently. Join us in this hands-on tutorial hosted by experts from Vizzu and Streamlit to discover how to transform data analysis into a dynamic, interactive experience.

Streamlit, celebrated for its user-friendly data app development platform, has recently integrated with Vizzu's ipyvizzu - an innovative open-source data visualization tool that emphasizes animation and storytelling. This collaboration empowers you to craft and share interactive, animated reports and dashboards that transcend traditional static presentations.

To maximize our learning time, please come prepared by following the setup steps listed at the end of the tutorial description, allowing us to focus solely on skill-building and progress.

Visualization Track
Visualization Track
15:00
60min
Keynote - Building and Productionizing RAG
Jerry Liu

Large Language Models (LLMs) are revolutionizing how users can search for, interact with, and generate new content. Some recent stacks and toolkits around Retrieval-Augmented Generation (RAG) and agents, enabling users to build applications such as chatbots using LLMs on their private data. In this talk we do a comprehensive survey of both basic and advanced RAG techniques. We show you what RAG is and how to setup a simple version. We show you how to evaluate and optimize RAG systems. We then talk on advanced concepts (agents, fine-tuning), and help you think about how to build a full-stack LLM app.

Keynote
Data Track
16:00
16:00
30min
Extremes, outliers, and GOATS: on life in a lognormal world
Allen Downey

The fastest runners are much faster than we expect from a Gaussian distribution, and the best chess players are much better. In almost every field of human endeavor, there are outliers who stand out even among the most talented people in the world. Where do they come from?

In this talk, I present as possible explanations two data-generating processes that yield lognormal distributions, and show that these models describe many real-world scenarios in natural and social sciences, engineering, and business. And I suggest methods -- using SciPy tools -- for identifying these distributions, estimating their parameters, and generating predictions.

General Track
General Track
16:00
30min
How I used Polars to build built functime, a next gen ML forecasting library
Luca Baggi

Everybody knows Polars revolutionised the dataframe landscape, yet fewer realise that machine learning is next. Thanks to its extreme speed, we can speed up feature engineering by 1-2 orders of magnitude. The true gains, however, span across the whole ML lifecycle, with significantly faster batch inference and effortless scaling (no PySpark required!).

Add a best-of-the-class set of tools for feature extraction, model evaluation and diagnostic visualisations and you'll get functime: a next-generation library for ML forecasting. Though time-series practitioners are the primary audience, there's something for all data scientists. It's not just forecasting: it's about building the next generation of machine learning libraries.

Data Track
Data Track
16:00
90min
Improving Open Data Quality using Python
Cesar Garcia

In this session we will demonstrate how to measure and improve the quality of open data using the open source Python library Great Expectations. Attendees will learn quality testing techniques and methodologies to prepare high-quality longitudinal datasets using Open Data from cities and regional portals.

Machine Learning Track
Machine Learning Track
16:30
16:30
30min
Arrow revolution in pandas and Dask
Matthew Rocklin, Patrick Hoefler

The pandas library for data manipulation and data analysis is the most widely used open source data science software library. Dask is the natural extension for scaling pandas workloads to more than a single machine. The continuing integration and adoption of Apache Arrow accelerates historical bottlenecks in both libraries.

Data Track
Data Track
16:30
30min
Understanding reactive execution in Shiny
Gordon Shotwell

Shiny for Python is a relatively new web application framework which uses transparent reactivity to build scalable web applications without code complexity. Shiny doesn't require you to write callbacks, but instead infers the relationships between components to minimally rerender them. This talk goes through the details of reactive programming to show why Shiny works, and how it can save you time and trouble.

Visualization Track
Visualization Track
17:00
17:00
90min
Building an Interactive Network Graph to Understand Communities
Lucas Durand

People are hard to understand, developers doubly so! In this tutorial, we will explore how communities form in organizations to develop a better solution than "The Org Chart". We will walk through using a few key Python libraries in the space, develop a toolkit for Clustering Attributed Graphs (more on that later) and build out an extensible interactive dashboard application that promises to take your legacy HR reporting structure to the next level.

Visualization Track
Visualization Track
17:00
30min
Data Tales from an Open Source Research Team
amanda casari, Sophia Vargas, María Cruz

Have you ever started a seemingly straight-forward project which you assumed would "only take a little bit", to find yourself hours later with all the tabs open, closer to the gnarliness of the truth, but still far away from a simple answer? Are you curious why you can't find the data you need, if open source is so open? We've all been there, including teams with literally decades of professional data and analysis experience. In this talk, our team from the Google Open Source Programs office will share stories and hard-learned lessons from our work researching and analyzing data to more deeply understand open source.

Data Track
Data Track
17:00
30min
VocalPy: a core Python package for acoustic communication research
David Nicholson

Almost all animals communicate with sound, but as far as we know only humans speak languages. How did speech evolve? How do animals like birds, bats, and dolphins learn their songs, and is it similar to how we learn to speak? Questions like these are answered by the study of acoustic communication. This talk will get you acquainted with this exciting research. Along the way you'll hear many different animal sounds, and find out how researchers in this area are using neural network models. You'll learn why there is a need for a core package for researchers in this area (think AstroPy for astronomy). We will present a package we've developed to meet that need, VocalPy, and give a demo of the features. Then we'll present some results we've obtained with VocalPy on evaluating methods for segmenting audio into sequences of animal sounds. Finally we'll share our development roadmap, and tell you how you can get involved with the VocalPy community.

General Track
General Track
17:30
17:30
30min
API development for data analysts/scientists with FastApi
Sara Iris Garcia

Get to know the basics of API development
without having a software development
background.
As every data analyst/scientist, you will
inevitably have to deal with APls, either for
downloading data, or to expose your model
for others to use.

In this talk, l will show you how easy is to
build your own API using FastApi.

Data Track
Data Track
17:30
30min
But what is a Gaussian process? Regression while knowing how certain you are
Quan Nguyen

Given a test data point similar to the training points, we should expect the prediction of a machine learning model to be accurate.
However, we don't have the same guarantee for the prediction on the test point very far away from the training data, but many models offer no quantification of this uncertainty in our predictions.
These models, including the increasingly popular neural networks, produce a single-valued number as the prediction of a test point of interest, making it difficult to quantify how much the user should have trust in this prediction.

Gaussian processes (GPs) address this concern; a GP outputs as its prediction of a given a test point, instead of a single number, a probability distribution representing the range that the value we're predicting is likely to fall into.
By looking at the mean of this distribution, we obtain the most likely predicted value; by inspecting the variance of the distribution, we can quantify how uncertain we are about this prediction.
This ability to produce well-calibrated uncertainty quantification gives GPs an edge in high-stakes machine learning use cases such as oil drilling, drug discovery, and product recommendation.

While GPs are widely used in academic research in Bayesian inference and active learning tasks, many ML practitioners still shy away from it, believing that they need a highly technical background to understand and use GPs.
This talk aims to dispel that message and offers a friendly introduction to GPs, including its fundamentals, how to implement it in Python, and common practices.
Data scientists and ML practitioners who are interested in uncertainty quantification and probabilistic ML will benefit from this talk.
While most background knowledge necessary to follow the talk will be covered, the audience should be familiar with common concepts in ML such as training data, predictive models, multivariate normal distributions, etc.

Machine Learning Track
Machine Learning Track
17:30
30min
Intake 2
Martin Durant

Intake is a python library for describing, cataloging, finding and loading data. It has had the
ethos of "load and get out of the way", which limited scope but provided a lot of convenience.
However, complexity built up over years, providing a barrier for new users to start with Intake.
In this talk, I will resent Intake 2, a complete rewrite of the package, featuring a much simpler
reader interface and removal of many complex and unused features. This overhaul also enabled the
development of a general purpose data pipelining description, making intake both simpler and
much more powerful.

General Track
General Track
18:00
18:00
90min
Machine Learning Track
18:00
240min
PyMC / ArviZ / PyTensor Sprint
Christian Luhmann

Join sprint at https://numfocus-org.zoom.us/j/81746276652?pwd=bIh9dapxLFXutcSztKa5IYwloGMIr8.1

Sprint Leaders
Christian Luhmann
Purna Chandra Mansingh
Jesse Grabowski

Sprint
Sprints
18:00
30min
Real-Time Revolution: Kickstarting Your Journey in Streaming Data
Zander Matheson

Stream processing is hard! It's expensive! It's unnecessary! Batch is all you need! It's hard to maintain! While some of these may sound true, the world of streaming data has come a long way and it is time we start to take advantage of data in real-time.

This talk dips your feet into the world of streaming data and demystifies some of the common misconceptions. We will cover some of the basics around streaming data and how you can get started with your first stream processing project with the Python open source stream processor Bytewax.

Data Track
Data Track
18:00
30min
The Hell, According to a Data Scientist
Giuditta Parolini

The talk takes inspiration from a famous literary piece, Dante Alighieri's "Inferno" (in Italian, the Hell) to offer data scientists a moral revenge on the data sinners they constantly encounter in their professional life. While Dante populates his Hell with political enemies and even former Popes, I redraw the map of Dante's Inferno finding a place and an adequate punishment for data sinners. With the help of the audience, I will make sure that creators of invalid CSV files, users of identifiers so unique that they are even longer than the recommended PEP 8 line length, and all other data sinners find their well-deserved place in Hell. The bottom line of the talk is that data scientists' life will not improve until organisations begin to manage their data properly and realise that data products and infrastructures can be developed only when data satisfy minimal usability criteria, such as machine-readability.

General Track
General Track
18:30
18:30
60min
Visualization Track
18:30
30min
Blazing fast I/O of data in the cloud with Daft Dataframes
Jay Chia

Daft (www.getdaft.io) is an open-sourced distributed Dataframe library, written in Rust but with a Python API. It features blazing fast cloud storage I/O with its Rust I/O layer, but all accessible via a familiar Python Dataframe interface. Load tens of thousands of CSV and Parquet files in seconds, all from the comfort of Python!

Data Track
Data Track
18:30
30min
Order up! How do I deliver it? Build on-demand logistics apps with Python, OR-Tools, and DecisionOps
Ryan O'Neil

What models do you need to run an on-demand logistics operation? Whether you’re building apps for delivery, mobility, or ecommerce, these three decision models can get you started: forecasting, scheduling, and routing. In this talk, we’ll build, test, and deploy each model using Python and Google OR-Tools in a DecisionOps workflow. This talk is for data scientists and decision algorithm developers.

General Track
General Track
19:00
19:00
30min
Getting better at Pokémon using data, Python, and ChatGPT.
Juan De Dios Santos

This talk concerns data analysis techniques applied to the Pokémon Trading Card game. From statistical odds of key card draws to insights from 100 matches and a dashboard, I'll show how data, code, and ChatGPT improve my card game strategies.

General Track
General Track
19:00
30min
LanceDB: lightweight billion-scale vector search for multimodal AI
Chang She

With LanceDB you can make your laptop more powerful than any distributed vector database for semantic search. LanceDB is an open-source embedded vector database. It's lightweight like SQLite but powerful enough to deliver real-time semantic search over a billion vectors on a laptop.
LanceDB is backed by Lance columnar format, which delivers up to 100x performance improvement over parquet for managing multimodal AI data (e.g., vectors, images, point clouds, and more). With it, Lance gives AI teams a high performance single-source of truth across the whole AI life-cycle from analytics to training to debugging.

In this talk we'll cover the use cases for production inference and in the data lake. We'll talk about the technical details of the Lance columnar format and what makes it different. And we'll show a demonstration of LanceDB for multi-modal semantic search.

Data Track
Data Track
19:30
19:30
90min
Build and deploy a Snowflake Native Application using Python
Gilberto Hernandez

As app developers, we’re accustomed to bringing the data in our data store directly to the custom APIs and UX/UI we build for our apps. But what if instead you could build an application in the same environment where the data lives? With Snowflake’s Native App Framework, you can build apps that run within Snowflake – right next to the data – using Python and Snowflake primitives. You can even monetize your app and drive revenue by distributing your app on the Snowflake Marketplace. In this session, Gilberto Hernandez, Lead Developer Advocate at Snowflake, will walk you step-by-step through building and deploying your first Snowflake Native App within Snowflake. To follow along in this lab, you’ll need:

  • A Snowflake account (create a free trial account at signup.snowflake.com – be sure to select AWS as the underlying cloud provider)

  • A code editor

Data Track
Data Track
11:00
11:00
60min
Machine Learning Track
11:00
30min
General Track
11:00
120min
All Them Data Engines: Pandas, Spark, Dask, Polars and more - Data Munging with Python circa 2023
Shaurya Agarwal

Versatility. / ˌvɜr səˈtɪl ɪ ti / noun: ability to adapt or be adapted to many different functions or activities.

Often our ecosystems limit us to one technology stack/framework/solution that we end up working on day-to-day. Maybe because the framework was chosen for us, maybe it's the one available at hand, maybe that's the skill most prevalent in the team, maybe it was chosen by following a decision analysis process, maybe other vagaries of the workplace were in play.

This is incredibly limiting in developing an intuition for problem solving, exploring the possibilities and simply being able to use the right tool for the right job.

In trying to gain experience on a new framework on our own, we are inundated with myriad concepts, jargon and "technical evangelism" so much that getting to the practical stuff often becomes an uphill battle for most of us.

This workshop aims to address this fundamental issue:
1. Get hands-on experience across some of the most in-demand data engineering frameworks around today - Pandas, Spark, Dask, Polars etc.
2. Focus on the one core thing - data munging - shaping data, analyzing it and deriving insights.

In this interactive 2-hour workshop, fellow data engineers will explore and gain practical experience with some of the industry's most sought-after data engineering frameworks. Through a series of engaging exercises and real-world-like examples, fellow attendees will be empowered to tackle data engineering challenges efficiently and effectively.

Data Track
Data Track
11:30
11:30
30min
Python-Driven Portfolios: Bridging Theory and Practice for Efficient Investments
Kalyan Prasad

Discover how Python empowers the implementation of Modern Portfolio Theory (MPT) for constructing efficient investment portfolios. Explore risk assessment, asset allocation optimization, and the construction of high-return portfolios through practical applications

General Track
General Track
12:00
12:00
30min
FawltyDeps: Finding undeclared and unused dependencies in your notebooks and projects
Johan Herland

Reproducibility is a cornerstone of science. However, most data science projects and notebooks struggle at the most basic level of declaring dependencies correctly. A recent study showed that 42% of the notebooks executed failed due to missing dependencies.

FawltyDeps is a dependency checker that finds imports you forgot to declare (undeclared dependencies), and packages you declared, but that are not imported in your code (unused dependencies).

This talk will guide you through integrating FawltyDeps in your manual or automated workflows and how this can improve the reproducibility of your notebooks and projects.

General Track
General Track
12:00
30min
The State of Production Machine Learning in 2023
Alejandro Saucedo

As the number of production machine learning use-cases increase, we find ourselves facing new and bigger challenges where more is at stake. Because of this, it's critical to identify the key areas to focus our efforts, so we can ensure our machine learning pipelines are reliable and scalable. In this talk we dive into the state of production machine learning, and we will cover the concepts that make production machine learning so challenging, as well as some of the recommended tools available to tackle these challenges.

Machine Learning Track
Machine Learning Track
12:30
12:30
30min
Enhancing your JupyterLab Developer Experience with Local LLMs and Code Snippets
Shivay Lamba

Join for an insightful session dedicated to enhancing developer productivity through effective code snippet management in Jupyterlab.

Topics covered include:
Unlocking the full potential of Pieces for seamless organization and retrieval of code snippets.
Crafting efficient and reusable code snippets.
Utilizing code snippet libraries to expedite development cycles.
Bridging the gap between code and documentation in Jupyterlab.
Tips on how to generate codes specific to your project based on Copilot's on-device language model

Machine Learning Track
Machine Learning Track
12:30
30min
The Internet's Best Experiment Yet
Avrahami

Reddit r/place was conceived as Reddits's 2017 April Fools tongue-in-cheek experiment. A shared white canvas of million pixels (1000 x 1000) appeared in a subreddit called ''place''. Redditors could change the color of a single pixel of their choosing. Once a Redditor manipulated a pixel, he/she gets blocked by the system for a random time (5-20 minutes), effectively preventing any single Redditor from having a significant influence on the canvas. The experiment, titled by Newsweek as the Internet's best experiment yet, attracted 16.1M pixel changes performed by 1.2M unique users during 72 hours. While the expected result was total chaos, verging on white noise, the final state of the canvas contained an intricate collage of complex logos and artwork. In this talk, I present the experiment in detail, the data that were collected during the r/place experiment, and the research opportunities associated with this natural experiment. I introduce three research studies that make use of this unique dataset and settings. I share the machine-learning models we built as well as the insights gained using explainability tools, all using Python.

General Track
General Track
13:00
13:00
90min
An Introduction to Pandas 2, Polars, and DuckDB
Matt Harrison

Pandas, Polars, and DuckDB can influence outcomes like productivity, integration, and velocity. This tutorial offers an introduction to three Python libraries: Pandas 2, Polars, and DuckDB. Attendees will be provided with an opportunity not only to comprehend the functionalities of these libraries but also to engage in hands-on experimentation.

Data Track
Data Track
13:00
30min
Get the best from your scikit-learn classifier: trusted probabilties and optimal binary decision
Guillaume Lemaitre

When operating a classifier in a production setting (i.e. predictive phase), practitioners are interested in potentially two different outputs: a "hard" decision used to leverage a business decision or/and a "soft" decision to get a confidence score linked to each potential decision (e.g. usually related to class probabilities).

Scikit-learn does not provide any flexibility to go from "soft" to "hard" predictions: it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function) to get class labels. However, optimizing a classifier to get a confidence score close to the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain accurate "hard" predictions using this heuristic. Reversely, training a classifier for an optimum "hard" prediction accuracy (with the cut-off constraint at 0.5) does not guarantee obtaining a calibrated classifier.

In this talk, we will present a new scikit-learn meta-estimator allowing us to get the best of the two worlds: a calibrated classifier providing optimum "hard" predictions. This meta-estimator will land in a future version of scikit-learn: https://github.com/scikit-learn/scikit-learn/pull/26120.

We will provide some insights regarding the way to obtain accurate probabilities and predictions and also illustrate how to use in practice this model on different use cases: cost-sensitive problems and imbalanced classification problems.

Machine Learning Track
Machine Learning Track
13:00
30min
Xorbits Inference: Model Serving Made Easy
Jon Wang

In the rapidly evolving landscape of AI and machine learning, the deployment and serving of models have become as crucial as their development. Xinference, a state-of-the-art library, emerges as a game-changer in this domain, offering seamless model serving capabilities. This talk aims to delve deep into how Xinference not only simplifies the process of deploying language, speech recognition, and multimodal models but also intelligently manages hardware resources. By choosing an appropriate inference runtime based on the hardware and allocating models to devices according to their usage, Xinference ensures optimal performance and resource utilization.

General Track
General Track
13:30
13:30
30min
Introduction to Using Julia for Decentralization by a Quant
Martin Y. Xie

The Julia programming language has proven to be a solution to the two-language problem, especially in the area of scientific computing. However, being both easy and fast is just the "syntactic" feature and benefit. An extension or superset of Julia can unleash its "semantic" potential to provide value to every company going through digital transformation. We will discuss in more details with examples in the context of quantitative trading and hedge fund. We will also mention Julia's potential in combination with technology such as blockchain. We will release a new package as the first step towards an extension or superset of Julia for building decentralized systems.

General Track
General Track
13:30
30min
Unravelling Hidden Technical Debt in ML: A Pythonic Approach to Robust Systems
Ravi Singh

Explore the labyrinth of hidden technical debt in ML systems through the lens of a data scientist. Delve into six core challenges, illustrated by a churn prediction model case, and discover Python's prowess in navigating these challenges. Uncover Python tools like Docker, Flyte, Airflow, and Git that arm you against technical debt, leading to resilient ML infrastructure.

Machine Learning Track
Machine Learning Track
14:00
14:00
120min
General Track
14:00
30min
DDataflow: An open-source end to end testing from machine learning pipelines
Theodore Meynard, Jean Carlo Machado

In the realm of machine learning, the complexity of data pipelines often hinders rapid experimentation and iteration. This talk will introduce DDataflow, an innovative open-source tool, designed to facilitate end-to-end testing in ML pipelines by leveraging decentralized data sampling. Attendees will gain insights into the challenges of unit testing in large-scale data pipelines, the design philosophy behind DDataflow, and practical implementation strategies to enhance the reliability and efficiency of their ML pipelines.

Machine Learning Track
Machine Learning Track
14:30
14:30
30min
Event-Driven Data Science: Reconceptualizing Machine Learning for the Real-time World
Prema Roman, Patrick Deziel

Did you know that 87% of data science projects never make it into production? While open source libraries like scikit-learn and TensorFlow are have gone a long way to democratize data science, they are also unintentionally limited by the assumptions and research focus of academia at the time they were released. One such assumption is that a model must be trained on batches of data and that all machine learning models need more data in order to perform well. This introduces a gap between training and inference as there is a requirement to accumulate enough instances for training. For real-time use cases such as anomaly detectors, models can become stale even before they get deployed to production.

Fortunately there has been a trend towards building machine learning models that are geared towards learning from streams of data and that can react immediately to changes in data. This form of learning is usually referred to as real-time machine learning, online learning, or incremental learning.

In this talk, we will compare the two approaches to machine learning, provide a brief overview of River, a library for building online learning models, and demo a real-time application using PyEnsign, a real-time data streaming client.

Machine Learning Track
Machine Learning Track
14:30
30min
Pandas 2, Dask or Polars? Quickly tackling larger data on a single machine
Ian Ozsvald, Giles Weaver

Pandas 2 brings new Arrow data types, faster calculations and better scalability. Dask scales Pandas across cores and recently released a new "expressions" optimization for faster computations. Polars is a new competitor to Pandas designed around Arrow with native multicore support. Which should you choose for modern research workflows? We'll solve a "just about fits in ram" data task using the 3 solutions, talking about the pros and cons so you can make the best choice for your research workflow. You'll leave with a clear idea of whether Pandas 2, Dask or Polars is the tool to invest in and how Polars fits into the existing numpy-focused ecosystem.
Do you still need 5x working RAM for Pandas operations (probably not!)? Can Pandas string operations actually be fast (sure)? Since Polars uses Arrow data structures, can we easily use tools like Scikit-learn and matplotlib (yes-maybe)? What limits do we still face? Could you switch to experimenting with Polars and if so, what gains and issues might you face?

Data Track
Data Track
15:00
15:00
60min
Data Track
15:00
60min
Keynote - Federated Learning with Flower: AI's Next Frontier
Daniel Beutel

Federated learning, a transformative technique, not only overcomes data limitations and privacy challenges but also enhances the trustworthiness of machine learning. By moving computation to data sources, it ensures privacy while enabling collaborative model training on vastly more data than before. This keynote introduces federated learning, demonstrates how Python developers can implement it in under 20 lines of code using the Flower framework (https://flower.dev), and provides an outlook on how federated learning will shape the next generation of machine learning systems.

Keynote
Machine Learning Track
16:00
16:00
120min
Data of an Unusual Size: A practical guide to analysis and interactive visualization of massive datasets
Pavithra Eswaramoorthy, Kim Pevey

While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.

In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a cloud provided by the presenter – starting from how the data is stored and read, to how it is processed and visualized.

Data Track
Data Track
16:00
90min
HPC in the cloud
Eddie

High performance computing has been a key tool for computational researchers for decades. More recently, cloud economics and the intense demand for running AI workloads has led to a convergence of older, established standards like MPI and a desire to run them on modern cloud frameworks like Kubernetes. In this tutorial, we will discuss the historical arc of massively parallel computation, focusing on how modern cloud frameworks like Kubernetes can both serve data scientists looking to build production-grade applications and run HPC-style jobs like MPI programs and distributed AI training. Moreover, we will show practical examples of submitting these jobs in a few lines of Python code.

General Track
General Track
16:00
90min
Who needs ChatGPT? Rock solid AI pipelines with Hugging Face and Kedro
Juan Luis Cano Rodríguez

Artificial Intelligence is all the rage, largely thanks to generative systems like ChatGPT, Midjourney, and the like. These commercial systems are very sophisticated and powerful, but also a bit opaque if you want to learn how they work or adapt them to your needs. What happens inside the 'black box'?

Luckily there are open AI models that you can download comfortably, study without restrictions, and adjust so that they do what you want. This requires some technical knowledge, but thanks to Hugging Face's models and their ecosystem of Python libraries, delving into AI is easier than ever.

You will soon find yourself combining different models, performing different tasks, and creating complex systems. But this complexity can grow very quickly, and soon you'll find yourself with spaghetti code if you are not careful. By using the Kedro catalog and Kedro pipelines, you will be able to organize the code in no time.

Machine Learning Track
Machine Learning Track
17:30
17:30
300min
General Track
17:30
30min
Customizing and Evaluating LLMs, an Ops Perspective
Dean Pleban

With LLM hype growing ever greater, almost every company is racing to create their LLM application, whether it's an internal tool to boost productivity, or a chat interface for their product.

However, if your product or domain isn't fully generic, you'll probably hit a lot of challenges that make deploying your LLM application a meaningful risk.

In this talk, I'll discuss the main challenges in customizing and evaluating LLMs for specific domains and applications, and suggest a few workflows and tools to help solve for those challenges.

Machine Learning Track
Machine Learning Track
18:00
18:00
30min
How can a learnt ML model unlearn something: Framework for "Machine Unlearning"
Saradindu Sengupta

In the recent past with the explosion of large language or vision models, it became inherently very costly to train models on new data. Coupled with that the various new data privacy legislations introduced or to be introduced make the "right to be forgotten" very costly and time-consuming. In this talk, we will go through the current state of research on "machine unlearning", how a learnt model forgets something without retraining and a general demonstration of the machine unlearning framework.

Machine Learning Track
Machine Learning Track
18:00
30min
Optimize first, parallelize second: a better path to faster data processing
Itamar Turner-Trauring

You’re processing a large amount of data with Python, and your code is too slow.
One obvious way to get faster results is adding multithreading or multiprocessing, so you can use multiple CPU cores.
Unfortunately, switching straight to parallelism is almost always premature, often unnecessary, and sometimes impossible.
We'll cover the different goals for performance, why parallelism only achieves one of them, the costs of parallelism, and the alternative: speeding up your code first.

Data Track
Data Track
18:00
240min
PyMC / ArviZ / PyTensor Sprint
Christian Luhmann

Join sprint at https://numfocus-org.zoom.us/j/86009093429?pwd=lz9sX0Cwu6gbCz5fGdqYwvQQi1RKhF.1

Sprint Leaders
Christian Luhmann
Purna Chandra Mansingh
Jesse Grabowski

Sprint
Sprints
18:30
18:30
30min
Maximize GPU Utilization for Model Training
Lu Qiu

When training models on large datasets, one of the biggest challenges is low GPU utilization. These powerful processors are often underutilized due to inefficient I/O and slow data loading. This mismatch between computation and storage leads to wasted GPU resources, low performance, and high cloud storage costs. The rise of generative AI and GPU scarcity is only making this problem worse.

In this session, Lu will discuss strategies for maximizing GPU utilization by using the open-source stack of PyTorch+Alluxio+S3.

Machine Learning Track
Machine Learning Track
18:30
30min
Xclim: Climate Data Processing and Analysis for Everyone
Trevor James Smith, Pascal Bourgault

Climate change projections and analyses are one of many processes that require not only sound scientific approaches, but also scalable and efficient algorithms due to the data-intensive nature of climate science.

Xclim is a cutting-edge climate analysis library built using xarray and dask to solve real problems in climate change analysis and processing, offering tools such as climate model ensemble selection and bias adjustment, climate data health check-ups, in addition to the ability to calculate more than 150 relevant climate indicators over enormous databases.

Developed with user-friendliness in mind, Xclim serves as the backbone of Environment and Climate Change Canada's ClimateData.ca platform.

Join us to explore Xclim's capabilities and follow a typical workflow, transforming vast climate datasets into actionable climate insights.

Data Track
Data Track
19:00
19:00
120min
From raw data to interactive data app in an hour: Powered by Python
Vino Duraisamy

As data practitioners, we often rely on the data engineering teams upstream to deliver the right data needed to train ML models at scale. Deploying these ML models as a data application to downstream business users is constrained by one’s web development experience. Using Snowpark, you can build end to end data pipelines, and data applications from scratch using Python.

Data Track
Data Track
19:00
30min
Real Time Machine Learning
Oren Netzer, Eitan Netzer

We live in a real time world, where information and consumer preferences can change multiple times per day. This requires machine learning algorithms that can be trained and updated frequently and cost effectively. This talk will demonstrate how data scientists can use new frameworks to develop ML models that can be easily updated with new data, without requiring retraining on the full dataset.

Machine Learning Track
Machine Learning Track
19:30
19:30
30min
Tricking Neural Networks : Explore Adversarial Attacks
Bernice Waweru

Large Language Models are pretty cool, but we need to be aware of how they can be compromised.
I will show how neural networks are vulnerable to attacks through an example of an adversarial attack on deep learning models in Natural Language Processing(NLP).
We’ll explore the mechanisms used to attack models, and you’ll get a new way to think about the security of deep learning models.
An understanding of deep learning is required.

Machine Learning Track
Machine Learning Track
20:00
20:00
30min
Bridging Classic ML Pipelines with the World of LLMs
Elijah ben Izzy, Stefan Krawczyk

You probably don’t need a fancy new tool to take advantage of LLMs. While the explosion of inventive AI applications feels like a massive leap forward, the core challenges in plugging them into the business represent an incremental step from the discipline of MLOps.

The challenges are largely equivalent. Retrieval augmented generation is effectively a recommendation system. Agents are the control flow of your program. Chains of LLM calls are simple DAGs. And you’re still stuck trying to monitor quantitatively unclear predictions, wrestle expensive, unstable APIs into submissions, and build out and manage complex dataflows.

The toolbox, as well, remains similar. In this talk we present the library Hamilton, an open source microframework for expressing dataflows in python. We show how it can help you build observable, stable, context-independent pipelines that span the gamut from classical ML to LLMs/RAG, enabling you to maintain sanity and keep up with the pace of change as everyone steps into the fascinating new world of AI.

Machine Learning Track
Machine Learning Track
20:30
20:30
30min
Compute anything with Metaflow
Ville Tuulos

Over the past years, the compute landscape has got much more fragmented and heterogenous: GenAI needs access to various types of GPUs, sometimes leveraging vertical scalability, sometimes horizontal. The demand for CPU-based compute has got more diverse as well, as vertically scaling, high-performance data engines like Arrow and DuckDB have reduced need for inefficient approaches based on horizontal scaling. On top of this, the competition amongst clouds and specialized compute providers is getting more intense, motivated by customer demands for cost-efficiency.

Since its inception, Metaflow, which was originally open-sourced by Netflix in 2019, has been built to address diverse compute needs. Instead of proposing a new universal compute paradigm like Spark, which requires bespoke libraries, Metaflow integrates with various compute substrates and providers, including all major clouds. Recently, Metaflow gained support for large-scale distributed workloads, including distributed training on large GPU clusters.

In this talk, we give an overview of the changing landscape for compute and describe how open-source Metaflow allows Python developers leverage various compute platforms easily.

Machine Learning Track
Machine Learning Track
21:00
21:00
90min
Data Track
21:00
90min
Full-stack Machine Learning and Generative AI for Data Scientists
hugo bowne-anderson

One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this tutorial, we present an introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows. We’ll cover a wide range of applications, including business-critical ML and data pipelines of today, as well as state-of-the-art generative AI and LLM use cases of tomorrow.

Machine Learning Track
Machine Learning Track
09:55
09:55
185min
Data Track
09:55
185min
General Track
10:00
10:00
300min
Machine Learning Track
10:00
690min
Visualization Track
10:00
90min
Building Contextual ChatBot using LLMs, Vector Databases and Python
Nabanita Roy

Chatbots that understand contexts and respond based on past conversations are a "Dream Come True" with state-of-the-art Generative AI models. In this tutorial, I will demonstrate building a chatbot using OpenAI API and LLMs available on HuggingFace. Besides, I will talk about the advantages of using LangChain and the different strategies that can be used to configure your chatbot to yield the best responses. Not only that, the chatbot can also get you the relevant texts (basically the context) from which it derives the answers for transparency, validation and troubleshooting. Python libraries like OpenAI, HuggingFace, LangChain and Streamlit will be used in the majority of the tutorial to build this GenAI-powered chatbot.

LLMs Track
LLM Track
11:30
11:30
30min
Accelerating fuzzy document deduplication to improve LLM training with RAPIDS and Dask
Jacob Tomlinson

Training Large Language Models (LLMs) requires a vast amount of input data, and the higher the quality of that data the better the model will be at producing useful natural language. NVIDIA NeMo Data Curator is a toolkit built with RAPIDS and Dask for extracting, cleaning, filtering and deduplicating training data for LLMs.

In this session, we will zoom in on one element of LLM pretraining and explore how we can scale out fuzzy deduplication of many terabytes of documents. We can run a distributed Jaccard similarity workload by deploying a RAPIDS accelerated Dask cluster on Kubernetes to remove duplicate documents from our training set.

LLMs Track
LLM Track
12:00
12:00
30min
LLMs: Beyond the Hype - A Practical Journey to Scale
Shashank Shekhar

The landscape of Large Language Models (LLMs) has expanded rapidly, offering users a diverse range of options for text generation and analysis. However, the cost associated with utilizing these LLMs can turn out to be very expensive. During this presentation, I will delve into practical strategies aimed at achieving a delicate balance: reducing inference costs while simultaneously elevating model performance, enhancing quality, and optimizing latency. Additionally, I will discuss essential architectural principles for constructing LLM-based systems and products, alongside pragmatic methodologies to fine-tune open-source LLM models, enhancing their performance in specific use-cases. I will also explore some practical evaluation methods for benchmarking models against baseline standards, delve into embedding techniques for precise query classification, and unravel the intricacies of shot-prompting strategies to bolster adaptability to unfamiliar data.

LLMs Track
LLM Track
12:30
12:30
30min
Productionizing Open Source LLMs
Sean Sheng

Open source large language models (LLMs) are now inching towards matching the proficiency of proprietary models, such as GPT-4. In addition, operating your own LLMs can unveil advantages in aspects like data privacy, model customizability, and cost efficiency. However, running your own LLMs and realizing these benefits in a production environment is not easy - it necessitates a precise set of optimization and a robust infrastructure. Come to this talk to learn about the problems you might face when using your own large language models, and find out how OpenLLM can help you solve them.

LLMs Track
LLM Track
13:00
13:00
120min
Architecting Data Tools: A Roadmap for Turning Theory and Data Projects into Python Packages
Ramon Perez

The goal of this workshop is to address the gap between the development of technical work -- whether that's via research or more traditional data science work -- and its reproducibility by providing attendees with the necessary knowledge to get started creating Python packages. This means that, if you're a researcher (with basic Python knowledge) wanting to make your theories more accessible via code, or a data professional wanting to share your Python code inside or outside of your organization, this workshop will help you understand how to contribute to, and develop, open-source projects from scratch.

General Track
General Track
13:00
30min
High speed data from the Lakehouse to DataFrames with Apache Arrow
Jim Dowling

In 2023, with the introduction of Pandas2, Apache Arrow became the dominant standard for both in-memory representation and over-the-wire transfer format for data in DataFrames.
In this talk, we will examine the performance benefits of using Apache Arrow end-to-end from the data lake or warehouse to client-side DataFrames. We will demonstrate in Python examples how data can now be moved between Pandas2, Polars, and DuckDB at no cost (zero-copy) and we will look how Arrow enables the replacement of row-oriented APIs for data retrieval (JDBC/ODBC) with column-oriented protocols (Arrow Flight and ADBC). We will show how we built a query service that bridges the data lake with Python clients. DataFrame clients can read data using a network hosted service that reads Arrow data from Parquet files, processes the data in Arrow format, and transfers the data to clients using Arrow Flight service. We will also look to a file-free future for DataFrames, where they can be easily stored and updated in a serverless platform.

Data Track
Data Track
13:00
30min
Leveraging open-source LLMs for production
Andrey Cheptsov

This talk examines using open-source LLMs for real-world purposes. It compares the benefits and drawbacks of open-source LLMs to proprietary options like OpenAI. The discussion covers the economics of hosting open-source LLMs, highlights serving frameworks, explores cloud GPU availability, and gives an overview of key open-source LLMs.

LLMs Track
LLM Track
13:00
180min
Naas Sprint
Jérémy Ravenel

Join the sprint at https://numfocus-org.zoom.us/j/88237670803?pwd=vSKWQ3FULy7ufuXQgWOK3OO0pyRhhC.1

Sprint Leader
Jeremy Ravenel (https://github.com/jravenel)

Sprint
Sprints
13:30
13:30
30min
Build AI-powered data pipeline without vector databases
Bobur Umurzokov

This presentation explores the challenges, such as cost, latency, and security, faced when developing a new (Large Language Model) LLM App and presents solutions to these obstacles. You will learn how to build your own AI-enabled real-time data pipeline without complex and fragmented typical LLM stacks such as vector databases, frameworks, or caches. We will leverage an open-source LLM App library in Python to implement real-time in-memory data indexing directly reading data from any compatible storage, processing, analyzing, and sending it to output streams.

Data Track
Data Track
13:30
30min
From RAGs to riches: Build an AI document interrogation app in 30 mins
Philip Meier

As we descend from the peak of the hype cycle around Large Language Models (LLMs), chat-based document interrogation systems have emerged as a high value practical use case. The ability to ask natural language questions and get relevant answers from a large corpus of documents has the potential to fundamentally transform organizations and make institutional knowledge accessible.

Retrieval-augmented generation (RAG) is a technique to make foundational LLMs more powerful and accurate, and a leading way to implement a personal or company-level chat-based document interrogation system. In this talk, we’ll understand RAG by creating a personal chat application. We’ll use a new OSS project called Ragna that provides a friendly Python and REST API, designed for this particular case. We’ll also demonstrate a web application that leverages the REST API built with Panel–a powerful OSS Python application development framework.

By the end of this talk, you will have an understanding of the fundamental components that form a RAG model as well as exposure to open source tools that can help you or your organization explore and build on your own applications.

LLMs Track
LLM Track
14:00
14:00
30min
Production Data to the Model: “Are You Getting My Drift?”
Gatha Varma

A shift is a poetic word for uncertainty. Winds shift, rivers and sands drift, and people change. Coming to the not-so-poetic world of data science, what about the data? Data comes from systems and people using them, so it is natural that data too will see the rigors of shift too. A model that was trained and tested for particular dynamics may assume the expected uncertainty in the data such as a shift in the user behavior. But what happens when the shift goes beyond expectations? How do teams detect the different types of data drift? More so, how do they tackle the detected drift? In this talk, I will gently introduce you to data drift and how the industry tackles this issue.

Data Track
Data Track
14:00
30min
Training large scale models using PyTorch
Shagun Sodhani

Learn about the different approaches for training large-scale machine learning models using PyTorch.

LLMs Track
LLM Track
14:30
14:30
270min
LLM Track
14:30
30min
cudf.pandas: The Zero Code Change GPU Accelerator for Pandas
Ashwin Srinath

Pandas is loved and venerated for its flexibility and ease-of-use. However, its oft-quoted slowness has prompted many others like duckdb, polars, and RAPIDS cuDF to step in and offer faster alternatives. These are all fantastic tools, but they have non-zero adoption costs, more restrictive APIs compared to pandas, and they don’t always work with 3rd party libraries that use pandas today.

cudf.pandas takes a fresh approach: instead of trying to be a replacement for pandas, it effectively accelerates pandas on the GPU. cudf.pandas requires no code changes (not even your pandas imports!), supports 100% of the pandas API, and third-party libraries that use pandas are magically accelerated on the GPU.

If you use pandas today and want to run your code on the GPU with 0 changes today, this talk is for you. If you are the maintainer of a library that uses pandas and you’d like to support GPUs with 0 changes today, this talk is for you. If you’re a Pythonista at heart and enjoy hearing about the proxy pattern and deep import customization, this talk is for you!

Data Track
Data Track
15:00
15:00
60min
Data Track
15:00
60min
General Track
15:00
60min
Keynote - Building Machine Learning Apps in Python with Gradio
Abubakar Abid

In this talk, we will cover the practical tools for modern machine learning for machine learning datasets, models, and demos. First, we will start by talking about How to Use the Hugging Face Hub, covering how to easily find the right models and datasets for your machine learning tasks. Then, we will walk through Building and Sharing ML Demos: covering how to quickly demo ML models for class presentations, portfolios, etc using the Gradio (www.gradio.dev) library.

Keynote
Machine Learning Track
16:00
16:00
30min
Ensuring Runtime Reproducibility in the Python Ecosystem
Pavithra Eswaramoorthy, Jaime Rodríguez-Guerra

The Python packaging ecosystem has a massive and diverse user community with various needs. A subset of this user base, data science and scientific computing communities, i.e., PyData communities, have historically relied on the conda package and environment management tools for their workflows. conda has robust solutions for packaging and distributing libraries and managing dependencies in environments, but there are still unsolved challenges for reliably reproducing runtime environments. For instance, compute-intensive R&D activities require certain reproducibility guarantees for collaborative development and ensure production-level tools' stability and integrity. Many teams lack proper documentation and dependable practices for installing and regenerating the same runtime conditions across their software pipelines and systems, leading to product instability and release and production delays.

In this talk, we will:
* Share reproducibility best practices for Python-based data science workflows. For this, we will present real-world examples where reproducibility was not a core requirement or consideration of the project but was introduced as an afterthought.
* Demonstrate a greenfield solution to this problem: conda-store, an open source project that ensures flexible yet reproducible environments with features like version control, role-based access control, and background enforcement of best practices, all the while incorporating a user-friendly user interface.

You will learn about all the variables that affect runtime conditions (like enumerating project dependencies and technical details about your operating system and hardware). We will also present a checklist of automated tasks that should be part of a reproducible workflow and the different packaging solutions in the PyData ecosystem with a deeper focus on conda-store. We hope to share the perspective of a downstream user of the packaging ecosystem and bring attention to the conversations around runtime-environment reproducibility.

General Track
General Track
16:00
90min
Predictive survival analysis with scikit-learn, scikit-survival and lifelines
Olivier Grisel

This tutorial will introduce how to train machine learning models for time-to-event prediction tasks (health care, predictive maintenance, marketing, insurance...) without introducing a bias from censored training (and evaluation) data.

Machine Learning Track
Machine Learning Track
16:00
30min
Unified batch and stream processing in python
Arthur Andres

Historically it's been difficult to reuse existing batch processing code in streaming application.
Because of this, ML engineers had to maintain two implementations of their jobs.
One for streaming and one for batch.

In this talk we'll introduce beavers, a stream processing library optimized for analytics.
It can be used to run both batch and streaming jobs with minimal code duplication, whilst still being good at both.

Data Track
Data Track
16:30
16:30
30min
Data Harvest: Unlocking Insights with Python Web Scraping
Yuliia Barabash

In today's data-driven world, knowing how to gather and analyze information is more critical than ever. Join us for a compact session on using Python and Scrapy to crawl the web and solve real-time problems. We'll cover the basics, and then dive into a practical example of collecting apartment data from the internet.

Data Track
Data Track
16:30
30min
Prefect Workflows for Scaling Acoustic Fisheries Survey Pipelines
Soham Butala

How many fish are in the ocean? To answer this efficiently, we attempt to modernize fisheries operations to support interoperable and scalable sonar data processing by building user-friendly customizable Prefect workflows. We share our story to inform others considering ways to provide modern orchestration tools to users without a lot of technical experience.

General Track
General Track
17:00
17:00
30min
Collaborate with your team using data science notebooks
Megan Lieu

In the spirit of constructive chaos, will be talking about data democratization - why it's important, what it means for organizations, and what's needed to make it happen.

General Track
General Track
17:00
30min
Data persistence with consistency and performance in a truly serverless system
William Dealtry

Fully serverless systems are compelling for a number of reasons; they are inherently scalable, highly available and have a low maintenance burden. The challenge with a serverless system is providing sufficiently strong guarantees of data consistency without either sacrificing performance or simply shifting the burden of maintaining consistency to an external client-server system. At ArcticDB (https://github.com/man-group/arcticdb) we have spent years refining a fully serverless model that pushes the boundaries of what can be achieved with nothing but a python library and commodity object storage. In this talk we will share re-usable techniques for ensuring data reliability without external synchronization.

Data Track
Data Track
17:30
17:30
30min
IID Got You Down? Resample Time Series Like A Pro
Sankalp Gilda

Unlock robust statistical inference for time series data with tsbootstrap, a new open source Python library implementing specialized bootstrapping techniques.

Data Track
Data Track
17:30
30min
Python as a Hackable Language for Interactive Data Science
Stephen Macke

Did you know that the core Python syntax and semantics can be tailored for interactive computing use cases? It turns out that more is possible than what you would expect! For example, at the most basic level, Jupyter supports basic syntax extensions like so-called "magic" operations. It turns out, however, that one can go much deeper. In this talk, I'll show that it's possible to augment and abuse Python to support a plethora of interactive use cases. I'll start with the simple example of building an optional chainer for Python (supporting syntax reminiscent of javascript like a?.b()?.c). I'll then show how to use these same ideas to accelerate data science operations, concluding with an example of how to perform full dataflow tracking in order to give users the illusion of dataframe queries that run instantaneously.

General Track
General Track
17:30
30min
sktime – the saga. Trials and tribulations of a charitable, openly governed open source project
Franz Kiraly

Are you a member or leader of an open source community with open governance and a charitable mission?

Sit down and listen. Listen how to grow, nurture, and protect your community. Watch while it grows, takes off, and spreads its wings. Listen to stories of clear blue skies, joyful adventures, strange lands, and epic battles. And when you embark on your journey with your friends, keep these tales close to your heart. May they warn you of the mistakes of others, may they shield you from any danger that finds you. May they guide you towards the promised pastures green.

No dragons were harmed in the preparation of this talk, nor does it contain statements that could be construed libelous in any relevant jurisdiction.

Machine Learning Track
Machine Learning Track
18:00
18:00
30min
Dashing through the snow (or Sharing your data), in a Quarto Dashboard
Mine Cetinkaya-Rundel

Quarto Dashboards make it easy to create interactive dashboards using Python, R, Julia, and Observable:

You can publish a group of related data visualizations as a dashboard, using a wide variety of components including Plotly, Leaflet, Jupyter Widgets, htmlwidgets; static graphics (Matplotlib, Seaborn, ggplot2, etc.); tabular data; value boxes; and text annotations. It's also flexible and easy to specify row and column-based layouts. The components are intelligently re-sized to fill the browser and adapted for display on mobile devices. Finally, you can author using any notebook editor (JupyterLab, etc.) or in plain text markdown with any text editor (VS Code, RStudio, Neovim, etc.).
Dashboards can be deployed as static web pages (no special server required) or you can optionally integrate a backend Shiny Server for enhanced interactivity.

General Track
General Track
18:00
30min
Kùzu: A Graph Database Management System for Python Graph Data Science
Guodong Jin

This talk presents Kùzu: a new open-sourced graph database management system (GDBMS) that is designed for Python graph data science (GDS) eco-system. Kùzu's embedded architecture makes it very easy to import as a library without a server setup and also provides performance advantages. Specifically users can: (i) ingest and model their application records in various raw file formats as a graph; (ii) query and transform these graphs using Cypher query language; and (iii) export graphs into popular Python GDS packages with no copy cost. We will live demo Kùzu's integration with NetworkX and Pytorch Geometric.

Data Track
Data Track
18:00
240min
Matplotlib Sprint
Kyle Sunden

Join sprint at https://numfocus-org.zoom.us/j/88901164458?pwd=44hL3o0IAavVVfHeUBNwCp4Ykcc7Zc.1

Sprint Leader
Kyle Sunden (@ksunden on github)

Sprint
Sprints
18:00
30min
Modeling Extreme Events with PyMC
Jorn Mossel

Extreme events are ubiquitous, ranging from temperature records to stock market crashes or network outages. Using extreme weather events as an example we show how they can be modeled in a Bayesian way using PyMC. We start with simple models and ultimately move on to a more advanced model by implementing a Gaussian Process Latent Variable Model, which allows us to perform spatial modeling of extreme events.

Machine Learning Track
Machine Learning Track
18:30
18:30
120min
Introduction to Machine Learning Pipelines: How to Prevent Data Leakage and Build Efficient Workflows
Cainã Max Couto da Silva

This webinar will introduce machine learning pipelines and discuss their importance in building efficient and robust workflows. It will explain how pipelines help to prevent data leakage and ensure model stability by allowing for proper separation of training, validation, and test data. Through a blend of theory and practice, it will provide and explain code chunks in Python using well-known open-source packages like scikit-learn (pipeline and column transformers) and feature-engine to ensure a complete understanding of the .fit(), .transform(), and .predict() methods. By the end of this webinar, the audience will have a solid understanding of the theory behind machine learning pipelines and practical examples of using them effectively in their projects.

Machine Learning Track
Machine Learning Track
18:30
90min
Keras (3) for the Curious and Creative
Ngesa Marvin

This session is designed for those who are curious about Keras and want to learn more about its capabilities for computer vision and stable diffusion. We will start with a refresher on the core deep learning concepts that are essential for understanding Keras. Then, we will dive into a quick introduction to Keras 3 with Jax, using object detection as an example. Next, we will explore how to use Keras CV and Keras 3 together for multi-framework modeling that includes . We will also discuss how to use pre-trained PyTorch models with Keras 3. Finally, we will wrap up with a discussion of stable diffusion, what it is, and how to implement it using Keras 3 and multi-framework modeling.

General Track
General Track
18:30
30min
We rewrote tsfresh in Polars and why you should too
Chris Lo, Mathieu Cayssol

tsfresh is a popular time-series feature extraction library with over 7500 stars and thousands of downloads per day. tsfresh, however, is over 6 years old and suffers from slow performance and an outdated API. These features describe key characteristics of the time-series using algorithms from statistics, econometrics, signal processing, and non-linear dynamics.

That's why we open-sourced functime: a new high-performance time-series machine-learning library. What makes functime special is it's written in the ground-up with polars, which is currently the world's fastest dataframe library built on Apache Arrow and Rust.

functime recently rewrote 100s of features from tsfresh into Polars. The result? Up to 50x improvement in speed and memory efficiency compared to existing Pandas / Numpy implementations. functime is now the world's fastest time-series feature extraction library. Moreover, functime effortlessly parallelizes work for thousands of time-series using Polar's highly-optimized Rayon backend,. No distributed cluster (e.g. Spark). needed!

This talk begins with a brief introduction of time-series feature extraction and its use-cases. We then deep dive into the reasons why Polars is an optimal query engine for time-series feature engineering. We discuss the challenges and learnings from our rewrite. In particular, we will demonstrate, through code and benchmarks, lesser-known Polars tips and tricks to squeeze 10x speedups in your data engineering workflows.

Data Track
Data Track
19:00
19:00
150min
Data Track
19:00
30min
Building Learning to Rank models for search using Large Language Models
Sujit Pal

The presentation describes a case study where Large Language Models were used to generate query-document relevance judgements. These judgements were then used to train Learning to Rank models which were used to rerank search results from an untuned engine, resulting in almost 20% increase in precision.

LLMs Track
LLM Track
19:30
19:30
90min
Using Large Language Models to improve your Search Engine
Nidhin Pattaniyil, Ravi, Mustafa Zengin

Every corner you look, everyone is talking about Large Language Models (LLMs).
Are you feeling a bit overwhelmed and looking for a simple intro and guided application of LLMs ?

Many internet companies have a search engine.
In this tutorial, we will cover practical use case of LLMS in improving a search engines such as

1) Understanding user intent in query
2) Checking if query is relevant to a document
3) Fine-tuning LLMs with custom corpus .
4) Updating the search engine documents with LLM knowledge.

This tutorial is meant to be beginner friendly and will focus on the practical use case.
No prior experience on search or advance machine learning needed.
Google Colab and an e-commerce dataset will be provided.

LLMs Track
LLM Track
20:00
20:00
30min
Hands-On Network Science
Colleen Farrelly, Franck Kalala Mutombo, Yae U. Gaba

In this talk, we will introduce network science and demonstrate its usefulness in mining different types of data, including social network data, time series data, and spatiotemporal data. Our talk will include practical, hands-on examples of real-world problems we've solved in the developing world with tools from network science--including epidemic forecasting, stock market crash prediction, and food pricing trend analysis across regions. Python code will be available for those who want to run the analysis themselves.

General Track
General Track
20:30
20:30
60min
Machine Learning Track
20:30
30min
NonlinearSolve.jl: how compiler smarts can help improve the performance of numerical methods
Chris Rackauckas

Many problems can be reduced down to solving f(x)=0, maybe even more than you think! Solving a stiff differential equation? Finding out where the ball hits the ground? Solving an inverse problem to find the parameters to fit a model? In this talk we'll showcase how SciML's NonlinearSolve.jl is a general system for solving nonlinear equations and demonstrate its ability to efficiently handle these kinds of problems with high stability and performance. We will focus on how compilers are being integrated into the numerical stack so that many of the things that were manual before, such as defining sparsity patterns, Jacobians, and adjoints, are all automated out-of-the-box making it greatly outperform purely numerical codes like SciPy or NLsolve.jl.

General Track
General Track
21:00
21:00
30min
General Track
21:00
30min
Orchestrating Generative AI Workflows to Deliver Business Value
hugo bowne-anderson

This talk explores a framework for how data scientists can deliver value with Generative AI: How can you embed LLMs and foundation models into your pre-existing software stack? How can you do so using Open Source Python? What changes about the production machine learning stack and what remains the same?

LLMs Track
LLM Track