PyData Global 2023
We all know chatGPT is smart, but is it smart enough to choose a function from a query. We will discuss it through openai function calling, and will let it choose the function according to the query. Also, discusses the extracted json output with a demo.
If you're using Jupyter Notebooks in your workflow - This session is for you. Learn about practical tips, workflows, and MLOps tools that help many teams, and individuals scale their work, better utilize Jupyter Notebooks, and successfully bring projects from research to production.
Data compression is not a one-codec-fits-all problem. It necessarily involves a trade-off between compression ratio and speed. A higher compression ratio usually results in a slower compression process. Depending on the needs, one may want to prioritize one over the other. The issue is that finding the optimal compression parameters can be a slow process due to the large number of combinations of compression parameters (codec, compression level, filter, split mode, number of threads, etc.), and it may require a significant amount of manual trial and error to find the best combinations.
Btune (https://btune.blosc.org) is a dynamic plugin for Blosc2 that can help finding the optimal combination of compression parameters for datasets compressed with Blosc2 (https://github.com/Blosc/c-blosc2, https://github.com/Blosc/python-blosc2), while significantly speeding up this process.
The modern mobile phone is an incredibly powerful computing device. However, mobile platforms have historically excluded the Python data science community, requiring specialist platform-specific skills, or making the use of Python data science tools exceedingly difficult.
This isn't true any more. In this talk, you'll learn how to build and run a app on your phone that uses the Python data analysis and visualization tools you're already familiar with.
When it comes to open source contributions design is often a second thought. There is a plethora of innovative open source software that is made with little or no contribution from experienced designers, which often leads to inconsistent interfaces, confusing interactions, and ultimately, a poor user experience. When paired together, strong open source projects and human-centered, empathetic design thinking can create software that users can actually use. This session will explore the opportunities for design in open source projects and how developers can exercise a few design practices to influence the adoption and usability of their projects. It will examine what experience design is, why it matters, and the principles behind effective design. Attendees will learn through hands-on activities how to incorporate design thinking strategies into their projects without sacrificing design and how doing so will result in a better product for their users.
As data scientists, you understand the power of data, but the true value lies in enabling others to explore and comprehend insights. Join us on a journey where data becomes more than just numbers – a dynamic, interactive story that anyone can engage with.
Streamlit, known for its user-friendly approach to data app development, is now enhanced with the integration of ipyvizzu. This innovative open-source data visualization tool places a strong emphasis on animation and storytelling. This combination empowers data scientists to craft and deploy immersive, animated reports and dashboards swiftly.
Imagine the impact of creating Streamlit apps that allow business stakeholders without data expertise to independently analyze complex datasets, generate custom animated charts, and construct interactive data narratives. It's a game-changer for data-driven decision-making.
Before finalizing a machine learning model, data scientists conduct dozens, if not hundreds, of experiments. To keep track of these experiments, they employ setups of varying complexity, including physical notebooks, spreadsheets, or even complex configurations using various libraries and dedicated infrastructure. In this practical presentation, I will demonstrate how you and your team can start tracking experiments right away using a very simple setup, with most of the ingredients you are probably already using.
In many domains, machine learning methods predict the future demand of some physical good or virtual service that comes with finite capacity. Those predictions are then typically used to plan an appropriate level of supply. Often, it is not possible to directly measure (and to train on) the actual demand, but only on the fraction of it that could be fulfilled under the given constraints in the past – such as finite stocks or limited capacity. That is, one predicts a different quantity than one measures. This talk explores the various surprising aspects of the demand-sales-distinction that can arise in data science projects. We explore the paradoxes and the most dramatic problems that one encounters and find out how to avoid them. This talk will sharpen your thoughts when dealing with such intricate settings, and allow you to create and utilize demand forecasts in the best possible way.
Many Python frameworks are suitable for creating basic dashboards, but struggle with more complex ones. Though many teams default to splitting into separate frontend and backend divisions when faced with increasing dashboard complexity, this approach introduces its own set of challenges, like reduced personnel interchangeability and cumbersome refactoring due to REST API changes.
Solara, our new web framework, addresses these challenges. We use the foundational principles of ReactJS, yet maintain the ease of writing only Python. Solara has a declarative API, designed for dynamic and complex UI's, yet easy to write. Reactive variables power our state management which automatically trigger rerenders. Our component-centric architecture stimulates code reusability, and hot reloading promotes efficient workflows. Together with our rich set of UI and data-focused components, Solara spans the entire spectrum from rapid prototyping to robust, complex dashboards.
Without modification your application and components will work in Jupyter, Voilà and on our standalone server for high scalability. Our server can run along existing FastAPI, Starlette, Flask and even Django servers to integrate with existing web services. We prioritize code quality and developer friendliness by including strong typing and first class support for unit and integration testing.
How can you make sure that your recommender systems work as expected? Once you put them into production and users start interacting with the model predictions, evaluating the model output quality might become tricky. In this talk, we will explore how to monitor the quality of recommender systems in production, detect data drift, and prevent known model failure modes.
Join the sprint at https://numfocus-org.zoom.us/j/81665667614?pwd=1x2JKibaHybUAztHkQ64bOSo5fCrCX.1
Philipp Rudiger (https://github.com/philippjfr)
Simon Hansen (https://github.com/Hoxbro)
Andrew Huang (https://github.com/ahuang11)
"You should never, ever deal with time zones if you can help it" Tom Scott
Instead, you should let your software deal with time zones for you.
Polars is a Dataframe library with full support for time zones - come and learn how to leverage it to its full potential!
sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack.
This tutorial gives an updated introduction to sktime and presents a vignette slideshow with the most important features added since pydata global 2022.
N-dimensional datasets are pervasive in many scientific areas, and getting quick slices of them is critical for an improved exploration experience. Blosc2 is a compression and format library that recently gained support for dealing with such multidimensional datasets. Crucially important, by leveraging compression, Blosc2 can deal with sparse datasets effectively as the zeroed parts are almost suppressed, whereas the non-zero parts can still be stored in smaller sizes than non-compressed counterparts. In addition, the new double data partition inside Blosc2 minimizes the decompression of unnecessary data and provides top-class slicing speed.
The Map of Open-Source Science (MOSS) is a researcher’s entry point into the world of OSS for science that helps to improve discoverability of existing open-source scientific software tools and the communities around them.
In this talk proposal, we will discuss how to detect the chain of fraudulent transactions and help the investigation agencies by providing useful insights to fight money laundering with the help of Python programming language and packages.
Data pipelines are essential for transforming, validating, and loading data from various sources into a target database or data warehouse. However, building and testing data pipelines can be challenging when the real data is not available, either due to privacy issues, technical limitations, or simply because the data is not yet collected. How can we ensure that our data pipelines are robust and reliable without having access to the actual data?
In this talk, we will share our experience of creating synthetic data to test data pipelines using Python. We will demonstrate how we used some statistical methods and Python packages such as Faker to generate realistic synthetic data for different use cases, such as customer profiles, transactions, and time series. We will also show how we used Flyway to load the synthetic data into a Postgres database and perform repeatable deployments. We will discuss the benefits and challenges of using synthetic data for testing data pipelines, as well as some best practices and tips for creating and using synthetic data effectively.
Cloud UX kinda sucks. It was written for cloud engineers who like very explicit systems, and always read the docs. This makes it a bad fit for data people (data scientists, data engineers, machine learning researchers) who rapidly learn and use several tools on a day-to-day basis. This mismatch in UX expectations results in poor utilization and wasted resources.
This talk goes through the challenges we faced when building a cloud UX for data people, and the kinds of solutions we ended up adopting when supporting Dask (parallel python) in a cloud environment.
The talk will discuss how Data & AI department at Jahez adopts innovative NLP advanced techniques within search queries in the app to identify commercial opportunities by onboarding new restaurants that are highly demanded by Jahez’s customers.
Data scientists strive to bridge the gap between raw data and actionable insights. Yet, the actual value of data lies in its accessibility to non-data experts who can unlock its potential independently. Join us in this hands-on tutorial hosted by experts from Vizzu and Streamlit to discover how to transform data analysis into a dynamic, interactive experience.
Streamlit, celebrated for its user-friendly data app development platform, has recently integrated with Vizzu's ipyvizzu - an innovative open-source data visualization tool that emphasizes animation and storytelling. This collaboration empowers you to craft and share interactive, animated reports and dashboards that transcend traditional static presentations.
To maximize our learning time, please come prepared by following the setup steps listed at the end of the tutorial description, allowing us to focus solely on skill-building and progress.
Large Language Models (LLMs) are revolutionizing how users can search for, interact with, and generate new content. Some recent stacks and toolkits around Retrieval-Augmented Generation (RAG) and agents, enabling users to build applications such as chatbots using LLMs on their private data. In this talk we do a comprehensive survey of both basic and advanced RAG techniques. We show you what RAG is and how to setup a simple version. We show you how to evaluate and optimize RAG systems. We then talk on advanced concepts (agents, fine-tuning), and help you think about how to build a full-stack LLM app.
The fastest runners are much faster than we expect from a Gaussian distribution, and the best chess players are much better. In almost every field of human endeavor, there are outliers who stand out even among the most talented people in the world. Where do they come from?
In this talk, I present as possible explanations two data-generating processes that yield lognormal distributions, and show that these models describe many real-world scenarios in natural and social sciences, engineering, and business. And I suggest methods -- using SciPy tools -- for identifying these distributions, estimating their parameters, and generating predictions.
Everybody knows Polars revolutionised the dataframe landscape, yet fewer realise that machine learning is next. Thanks to its extreme speed, we can speed up feature engineering by 1-2 orders of magnitude. The true gains, however, span across the whole ML lifecycle, with significantly faster batch inference and effortless scaling (no PySpark required!).
Add a best-of-the-class set of tools for feature extraction, model evaluation and diagnostic visualisations and you'll get functime: a next-generation library for ML forecasting. Though time-series practitioners are the primary audience, there's something for all data scientists. It's not just forecasting: it's about building the next generation of machine learning libraries.
In this session we will demonstrate how to measure and improve the quality of open data using the open source Python library Great Expectations. Attendees will learn quality testing techniques and methodologies to prepare high-quality longitudinal datasets using Open Data from cities and regional portals.
The pandas library for data manipulation and data analysis is the most widely used open source data science software library. Dask is the natural extension for scaling pandas workloads to more than a single machine. The continuing integration and adoption of Apache Arrow accelerates historical bottlenecks in both libraries.
Shiny for Python is a relatively new web application framework which uses transparent reactivity to build scalable web applications without code complexity. Shiny doesn't require you to write callbacks, but instead infers the relationships between components to minimally rerender them. This talk goes through the details of reactive programming to show why Shiny works, and how it can save you time and trouble.
People are hard to understand, developers doubly so! In this tutorial, we will explore how communities form in organizations to develop a better solution than "The Org Chart". We will walk through using a few key Python libraries in the space, develop a toolkit for Clustering Attributed Graphs (more on that later) and build out an extensible interactive dashboard application that promises to take your legacy HR reporting structure to the next level.
Have you ever started a seemingly straight-forward project which you assumed would "only take a little bit", to find yourself hours later with all the tabs open, closer to the gnarliness of the truth, but still far away from a simple answer? Are you curious why you can't find the data you need, if open source is so open? We've all been there, including teams with literally decades of professional data and analysis experience. In this talk, our team from the Google Open Source Programs office will share stories and hard-learned lessons from our work researching and analyzing data to more deeply understand open source.
Almost all animals communicate with sound, but as far as we know only humans speak languages. How did speech evolve? How do animals like birds, bats, and dolphins learn their songs, and is it similar to how we learn to speak? Questions like these are answered by the study of acoustic communication. This talk will get you acquainted with this exciting research. Along the way you'll hear many different animal sounds, and find out how researchers in this area are using neural network models. You'll learn why there is a need for a core package for researchers in this area (think AstroPy for astronomy). We will present a package we've developed to meet that need, VocalPy, and give a demo of the features. Then we'll present some results we've obtained with VocalPy on evaluating methods for segmenting audio into sequences of animal sounds. Finally we'll share our development roadmap, and tell you how you can get involved with the VocalPy community.
Get to know the basics of API development
without having a software development
background.
As every data analyst/scientist, you will
inevitably have to deal with APls, either for
downloading data, or to expose your model
for others to use.
In this talk, l will show you how easy is to
build your own API using FastApi.
Given a test data point similar to the training points, we should expect the prediction of a machine learning model to be accurate.
However, we don't have the same guarantee for the prediction on the test point very far away from the training data, but many models offer no quantification of this uncertainty in our predictions.
These models, including the increasingly popular neural networks, produce a single-valued number as the prediction of a test point of interest, making it difficult to quantify how much the user should have trust in this prediction.
Gaussian processes (GPs) address this concern; a GP outputs as its prediction of a given a test point, instead of a single number, a probability distribution representing the range that the value we're predicting is likely to fall into.
By looking at the mean of this distribution, we obtain the most likely predicted value; by inspecting the variance of the distribution, we can quantify how uncertain we are about this prediction.
This ability to produce well-calibrated uncertainty quantification gives GPs an edge in high-stakes machine learning use cases such as oil drilling, drug discovery, and product recommendation.
While GPs are widely used in academic research in Bayesian inference and active learning tasks, many ML practitioners still shy away from it, believing that they need a highly technical background to understand and use GPs.
This talk aims to dispel that message and offers a friendly introduction to GPs, including its fundamentals, how to implement it in Python, and common practices.
Data scientists and ML practitioners who are interested in uncertainty quantification and probabilistic ML will benefit from this talk.
While most background knowledge necessary to follow the talk will be covered, the audience should be familiar with common concepts in ML such as training data, predictive models, multivariate normal distributions, etc.
Intake is a python library for describing, cataloging, finding and loading data. It has had the
ethos of "load and get out of the way", which limited scope but provided a lot of convenience.
However, complexity built up over years, providing a barrier for new users to start with Intake.
In this talk, I will resent Intake 2, a complete rewrite of the package, featuring a much simpler
reader interface and removal of many complex and unused features. This overhaul also enabled the
development of a general purpose data pipelining description, making intake both simpler and
much more powerful.
Join sprint at https://numfocus-org.zoom.us/j/81746276652?pwd=bIh9dapxLFXutcSztKa5IYwloGMIr8.1
Sprint Leaders
Christian Luhmann
Purna Chandra Mansingh
Jesse Grabowski
Stream processing is hard! It's expensive! It's unnecessary! Batch is all you need! It's hard to maintain! While some of these may sound true, the world of streaming data has come a long way and it is time we start to take advantage of data in real-time.
This talk dips your feet into the world of streaming data and demystifies some of the common misconceptions. We will cover some of the basics around streaming data and how you can get started with your first stream processing project with the Python open source stream processor Bytewax.
The talk takes inspiration from a famous literary piece, Dante Alighieri's "Inferno" (in Italian, the Hell) to offer data scientists a moral revenge on the data sinners they constantly encounter in their professional life. While Dante populates his Hell with political enemies and even former Popes, I redraw the map of Dante's Inferno finding a place and an adequate punishment for data sinners. With the help of the audience, I will make sure that creators of invalid CSV files, users of identifiers so unique that they are even longer than the recommended PEP 8 line length, and all other data sinners find their well-deserved place in Hell. The bottom line of the talk is that data scientists' life will not improve until organisations begin to manage their data properly and realise that data products and infrastructures can be developed only when data satisfy minimal usability criteria, such as machine-readability.
Daft (www.getdaft.io) is an open-sourced distributed Dataframe library, written in Rust but with a Python API. It features blazing fast cloud storage I/O with its Rust I/O layer, but all accessible via a familiar Python Dataframe interface. Load tens of thousands of CSV and Parquet files in seconds, all from the comfort of Python!
What models do you need to run an on-demand logistics operation? Whether you’re building apps for delivery, mobility, or ecommerce, these three decision models can get you started: forecasting, scheduling, and routing. In this talk, we’ll build, test, and deploy each model using Python and Google OR-Tools in a DecisionOps workflow. This talk is for data scientists and decision algorithm developers.
This talk concerns data analysis techniques applied to the Pokémon Trading Card game. From statistical odds of key card draws to insights from 100 matches and a dashboard, I'll show how data, code, and ChatGPT improve my card game strategies.
With LanceDB you can make your laptop more powerful than any distributed vector database for semantic search. LanceDB is an open-source embedded vector database. It's lightweight like SQLite but powerful enough to deliver real-time semantic search over a billion vectors on a laptop.
LanceDB is backed by Lance columnar format, which delivers up to 100x performance improvement over parquet for managing multimodal AI data (e.g., vectors, images, point clouds, and more). With it, Lance gives AI teams a high performance single-source of truth across the whole AI life-cycle from analytics to training to debugging.
In this talk we'll cover the use cases for production inference and in the data lake. We'll talk about the technical details of the Lance columnar format and what makes it different. And we'll show a demonstration of LanceDB for multi-modal semantic search.
As app developers, we’re accustomed to bringing the data in our data store directly to the custom APIs and UX/UI we build for our apps. But what if instead you could build an application in the same environment where the data lives? With Snowflake’s Native App Framework, you can build apps that run within Snowflake – right next to the data – using Python and Snowflake primitives. You can even monetize your app and drive revenue by distributing your app on the Snowflake Marketplace. In this session, Gilberto Hernandez, Lead Developer Advocate at Snowflake, will walk you step-by-step through building and deploying your first Snowflake Native App within Snowflake. To follow along in this lab, you’ll need:
-
A Snowflake account (create a free trial account at signup.snowflake.com – be sure to select AWS as the underlying cloud provider)
-
A code editor
Versatility. / ˌvɜr səˈtɪl ɪ ti / noun: ability to adapt or be adapted to many different functions or activities.
Often our ecosystems limit us to one technology stack/framework/solution that we end up working on day-to-day. Maybe because the framework was chosen for us, maybe it's the one available at hand, maybe that's the skill most prevalent in the team, maybe it was chosen by following a decision analysis process, maybe other vagaries of the workplace were in play.
This is incredibly limiting in developing an intuition for problem solving, exploring the possibilities and simply being able to use the right tool for the right job.
In trying to gain experience on a new framework on our own, we are inundated with myriad concepts, jargon and "technical evangelism" so much that getting to the practical stuff often becomes an uphill battle for most of us.
This workshop aims to address this fundamental issue:
1. Get hands-on experience across some of the most in-demand data engineering frameworks around today - Pandas, Spark, Dask, Polars etc.
2. Focus on the one core thing - data munging - shaping data, analyzing it and deriving insights.
In this interactive 2-hour workshop, fellow data engineers will explore and gain practical experience with some of the industry's most sought-after data engineering frameworks. Through a series of engaging exercises and real-world-like examples, fellow attendees will be empowered to tackle data engineering challenges efficiently and effectively.
Discover how Python empowers the implementation of Modern Portfolio Theory (MPT) for constructing efficient investment portfolios. Explore risk assessment, asset allocation optimization, and the construction of high-return portfolios through practical applications
Reproducibility is a cornerstone of science. However, most data science projects and notebooks struggle at the most basic level of declaring dependencies correctly. A recent study showed that 42% of the notebooks executed failed due to missing dependencies.
FawltyDeps is a dependency checker that finds imports you forgot to declare (undeclared dependencies), and packages you declared, but that are not imported in your code (unused dependencies).
This talk will guide you through integrating FawltyDeps in your manual or automated workflows and how this can improve the reproducibility of your notebooks and projects.
As the number of production machine learning use-cases increase, we find ourselves facing new and bigger challenges where more is at stake. Because of this, it's critical to identify the key areas to focus our efforts, so we can ensure our machine learning pipelines are reliable and scalable. In this talk we dive into the state of production machine learning, and we will cover the concepts that make production machine learning so challenging, as well as some of the recommended tools available to tackle these challenges.
Join for an insightful session dedicated to enhancing developer productivity through effective code snippet management in Jupyterlab.
Topics covered include:
Unlocking the full potential of Pieces for seamless organization and retrieval of code snippets.
Crafting efficient and reusable code snippets.
Utilizing code snippet libraries to expedite development cycles.
Bridging the gap between code and documentation in Jupyterlab.
Tips on how to generate codes specific to your project based on Copilot's on-device language model
Reddit r/place was conceived as Reddits's 2017 April Fools tongue-in-cheek experiment. A shared white canvas of million pixels (1000 x 1000) appeared in a subreddit called ''place''. Redditors could change the color of a single pixel of their choosing. Once a Redditor manipulated a pixel, he/she gets blocked by the system for a random time (5-20 minutes), effectively preventing any single Redditor from having a significant influence on the canvas. The experiment, titled by Newsweek as the Internet's best experiment yet, attracted 16.1M pixel changes performed by 1.2M unique users during 72 hours. While the expected result was total chaos, verging on white noise, the final state of the canvas contained an intricate collage of complex logos and artwork. In this talk, I present the experiment in detail, the data that were collected during the r/place experiment, and the research opportunities associated with this natural experiment. I introduce three research studies that make use of this unique dataset and settings. I share the machine-learning models we built as well as the insights gained using explainability tools, all using Python.
Pandas, Polars, and DuckDB can influence outcomes like productivity, integration, and velocity. This tutorial offers an introduction to three Python libraries: Pandas 2, Polars, and DuckDB. Attendees will be provided with an opportunity not only to comprehend the functionalities of these libraries but also to engage in hands-on experimentation.
When operating a classifier in a production setting (i.e. predictive phase), practitioners are interested in potentially two different outputs: a "hard" decision used to leverage a business decision or/and a "soft" decision to get a confidence score linked to each potential decision (e.g. usually related to class probabilities).
Scikit-learn does not provide any flexibility to go from "soft" to "hard" predictions: it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function
) to get class labels. However, optimizing a classifier to get a confidence score close to the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain accurate "hard" predictions using this heuristic. Reversely, training a classifier for an optimum "hard" prediction accuracy (with the cut-off constraint at 0.5) does not guarantee obtaining a calibrated classifier.
In this talk, we will present a new scikit-learn meta-estimator allowing us to get the best of the two worlds: a calibrated classifier providing optimum "hard" predictions. This meta-estimator will land in a future version of scikit-learn: https://github.com/scikit-learn/scikit-learn/pull/26120.
We will provide some insights regarding the way to obtain accurate probabilities and predictions and also illustrate how to use in practice this model on different use cases: cost-sensitive problems and imbalanced classification problems.
In the rapidly evolving landscape of AI and machine learning, the deployment and serving of models have become as crucial as their development. Xinference, a state-of-the-art library, emerges as a game-changer in this domain, offering seamless model serving capabilities. This talk aims to delve deep into how Xinference not only simplifies the process of deploying language, speech recognition, and multimodal models but also intelligently manages hardware resources. By choosing an appropriate inference runtime based on the hardware and allocating models to devices according to their usage, Xinference ensures optimal performance and resource utilization.
The Julia programming language has proven to be a solution to the two-language problem, especially in the area of scientific computing. However, being both easy and fast is just the "syntactic" feature and benefit. An extension or superset of Julia can unleash its "semantic" potential to provide value to every company going through digital transformation. We will discuss in more details with examples in the context of quantitative trading and hedge fund. We will also mention Julia's potential in combination with technology such as blockchain. We will release a new package as the first step towards an extension or superset of Julia for building decentralized systems.
Explore the labyrinth of hidden technical debt in ML systems through the lens of a data scientist. Delve into six core challenges, illustrated by a churn prediction model case, and discover Python's prowess in navigating these challenges. Uncover Python tools like Docker, Flyte, Airflow, and Git that arm you against technical debt, leading to resilient ML infrastructure.
In the realm of machine learning, the complexity of data pipelines often hinders rapid experimentation and iteration. This talk will introduce DDataflow, an innovative open-source tool, designed to facilitate end-to-end testing in ML pipelines by leveraging decentralized data sampling. Attendees will gain insights into the challenges of unit testing in large-scale data pipelines, the design philosophy behind DDataflow, and practical implementation strategies to enhance the reliability and efficiency of their ML pipelines.
Did you know that 87% of data science projects never make it into production? While open source libraries like scikit-learn and TensorFlow are have gone a long way to democratize data science, they are also unintentionally limited by the assumptions and research focus of academia at the time they were released. One such assumption is that a model must be trained on batches of data and that all machine learning models need more data in order to perform well. This introduces a gap between training and inference as there is a requirement to accumulate enough instances for training. For real-time use cases such as anomaly detectors, models can become stale even before they get deployed to production.
Fortunately there has been a trend towards building machine learning models that are geared towards learning from streams of data and that can react immediately to changes in data. This form of learning is usually referred to as real-time machine learning, online learning, or incremental learning.
In this talk, we will compare the two approaches to machine learning, provide a brief overview of River, a library for building online learning models, and demo a real-time application using PyEnsign, a real-time data streaming client.
Pandas 2 brings new Arrow data types, faster calculations and better scalability. Dask scales Pandas across cores and recently released a new "expressions" optimization for faster computations. Polars is a new competitor to Pandas designed around Arrow with native multicore support. Which should you choose for modern research workflows? We'll solve a "just about fits in ram" data task using the 3 solutions, talking about the pros and cons so you can make the best choice for your research workflow. You'll leave with a clear idea of whether Pandas 2, Dask or Polars is the tool to invest in and how Polars fits into the existing numpy-focused ecosystem.
Do you still need 5x working RAM for Pandas operations (probably not!)? Can Pandas string operations actually be fast (sure)? Since Polars uses Arrow data structures, can we easily use tools like Scikit-learn and matplotlib (yes-maybe)? What limits do we still face? Could you switch to experimenting with Polars and if so, what gains and issues might you face?
Federated learning, a transformative technique, not only overcomes data limitations and privacy challenges but also enhances the trustworthiness of machine learning. By moving computation to data sources, it ensures privacy while enabling collaborative model training on vastly more data than before. This keynote introduces federated learning, demonstrates how Python developers can implement it in under 20 lines of code using the Flower framework (https://flower.dev), and provides an outlook on how federated learning will shape the next generation of machine learning systems.
While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.
In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a cloud provided by the presenter – starting from how the data is stored and read, to how it is processed and visualized.
High performance computing has been a key tool for computational researchers for decades. More recently, cloud economics and the intense demand for running AI workloads has led to a convergence of older, established standards like MPI and a desire to run them on modern cloud frameworks like Kubernetes. In this tutorial, we will discuss the historical arc of massively parallel computation, focusing on how modern cloud frameworks like Kubernetes can both serve data scientists looking to build production-grade applications and run HPC-style jobs like MPI programs and distributed AI training. Moreover, we will show practical examples of submitting these jobs in a few lines of Python code.
Artificial Intelligence is all the rage, largely thanks to generative systems like ChatGPT, Midjourney, and the like. These commercial systems are very sophisticated and powerful, but also a bit opaque if you want to learn how they work or adapt them to your needs. What happens inside the 'black box'?
Luckily there are open AI models that you can download comfortably, study without restrictions, and adjust so that they do what you want. This requires some technical knowledge, but thanks to Hugging Face's models and their ecosystem of Python libraries, delving into AI is easier than ever.
You will soon find yourself combining different models, performing different tasks, and creating complex systems. But this complexity can grow very quickly, and soon you'll find yourself with spaghetti code if you are not careful. By using the Kedro catalog and Kedro pipelines, you will be able to organize the code in no time.
With LLM hype growing ever greater, almost every company is racing to create their LLM application, whether it's an internal tool to boost productivity, or a chat interface for their product.
However, if your product or domain isn't fully generic, you'll probably hit a lot of challenges that make deploying your LLM application a meaningful risk.
In this talk, I'll discuss the main challenges in customizing and evaluating LLMs for specific domains and applications, and suggest a few workflows and tools to help solve for those challenges.
In the recent past with the explosion of large language or vision models, it became inherently very costly to train models on new data. Coupled with that the various new data privacy legislations introduced or to be introduced make the "right to be forgotten" very costly and time-consuming. In this talk, we will go through the current state of research on "machine unlearning", how a learnt model forgets something without retraining and a general demonstration of the machine unlearning framework.
You’re processing a large amount of data with Python, and your code is too slow.
One obvious way to get faster results is adding multithreading or multiprocessing, so you can use multiple CPU cores.
Unfortunately, switching straight to parallelism is almost always premature, often unnecessary, and sometimes impossible.
We'll cover the different goals for performance, why parallelism only achieves one of them, the costs of parallelism, and the alternative: speeding up your code first.
Join sprint at https://numfocus-org.zoom.us/j/86009093429?pwd=lz9sX0Cwu6gbCz5fGdqYwvQQi1RKhF.1
Sprint Leaders
Christian Luhmann
Purna Chandra Mansingh
Jesse Grabowski
When training models on large datasets, one of the biggest challenges is low GPU utilization. These powerful processors are often underutilized due to inefficient I/O and slow data loading. This mismatch between computation and storage leads to wasted GPU resources, low performance, and high cloud storage costs. The rise of generative AI and GPU scarcity is only making this problem worse.
In this session, Lu will discuss strategies for maximizing GPU utilization by using the open-source stack of PyTorch+Alluxio+S3.
Climate change projections and analyses are one of many processes that require not only sound scientific approaches, but also scalable and efficient algorithms due to the data-intensive nature of climate science.
Xclim is a cutting-edge climate analysis library built using xarray and dask to solve real problems in climate change analysis and processing, offering tools such as climate model ensemble selection and bias adjustment, climate data health check-ups, in addition to the ability to calculate more than 150 relevant climate indicators over enormous databases.
Developed with user-friendliness in mind, Xclim serves as the backbone of Environment and Climate Change Canada's ClimateData.ca platform.
Join us to explore Xclim's capabilities and follow a typical workflow, transforming vast climate datasets into actionable climate insights.
As data practitioners, we often rely on the data engineering teams upstream to deliver the right data needed to train ML models at scale. Deploying these ML models as a data application to downstream business users is constrained by one’s web development experience. Using Snowpark, you can build end to end data pipelines, and data applications from scratch using Python.
We live in a real time world, where information and consumer preferences can change multiple times per day. This requires machine learning algorithms that can be trained and updated frequently and cost effectively. This talk will demonstrate how data scientists can use new frameworks to develop ML models that can be easily updated with new data, without requiring retraining on the full dataset.
Large Language Models are pretty cool, but we need to be aware of how they can be compromised.
I will show how neural networks are vulnerable to attacks through an example of an adversarial attack on deep learning models in Natural Language Processing(NLP).
We’ll explore the mechanisms used to attack models, and you’ll get a new way to think about the security of deep learning models.
An understanding of deep learning is required.
You probably don’t need a fancy new tool to take advantage of LLMs. While the explosion of inventive AI applications feels like a massive leap forward, the core challenges in plugging them into the business represent an incremental step from the discipline of MLOps.
The challenges are largely equivalent. Retrieval augmented generation is effectively a recommendation system. Agents are the control flow of your program. Chains of LLM calls are simple DAGs. And you’re still stuck trying to monitor quantitatively unclear predictions, wrestle expensive, unstable APIs into submissions, and build out and manage complex dataflows.
The toolbox, as well, remains similar. In this talk we present the library Hamilton, an open source microframework for expressing dataflows in python. We show how it can help you build observable, stable, context-independent pipelines that span the gamut from classical ML to LLMs/RAG, enabling you to maintain sanity and keep up with the pace of change as everyone steps into the fascinating new world of AI.
Over the past years, the compute landscape has got much more fragmented and heterogenous: GenAI needs access to various types of GPUs, sometimes leveraging vertical scalability, sometimes horizontal. The demand for CPU-based compute has got more diverse as well, as vertically scaling, high-performance data engines like Arrow and DuckDB have reduced need for inefficient approaches based on horizontal scaling. On top of this, the competition amongst clouds and specialized compute providers is getting more intense, motivated by customer demands for cost-efficiency.
Since its inception, Metaflow, which was originally open-sourced by Netflix in 2019, has been built to address diverse compute needs. Instead of proposing a new universal compute paradigm like Spark, which requires bespoke libraries, Metaflow integrates with various compute substrates and providers, including all major clouds. Recently, Metaflow gained support for large-scale distributed workloads, including distributed training on large GPU clusters.
In this talk, we give an overview of the changing landscape for compute and describe how open-source Metaflow allows Python developers leverage various compute platforms easily.
One of the key questions in modern data science and machine learning, for businesses and practitioners alike, is how do you move machine learning projects from prototype and experiment to production as a repeatable process. In this tutorial, we present an introduction to the landscape of production-grade tools, techniques, and workflows that bridge the gap between laptop data science and production ML workflows. We’ll cover a wide range of applications, including business-critical ML and data pipelines of today, as well as state-of-the-art generative AI and LLM use cases of tomorrow.
Chatbots that understand contexts and respond based on past conversations are a "Dream Come True" with state-of-the-art Generative AI models. In this tutorial, I will demonstrate building a chatbot using OpenAI API and LLMs available on HuggingFace. Besides, I will talk about the advantages of using LangChain and the different strategies that can be used to configure your chatbot to yield the best responses. Not only that, the chatbot can also get you the relevant texts (basically the context) from which it derives the answers for transparency, validation and troubleshooting. Python libraries like OpenAI, HuggingFace, LangChain and Streamlit will be used in the majority of the tutorial to build this GenAI-powered chatbot.
Training Large Language Models (LLMs) requires a vast amount of input data, and the higher the quality of that data the better the model will be at producing useful natural language. NVIDIA NeMo Data Curator is a toolkit built with RAPIDS and Dask for extracting, cleaning, filtering and deduplicating training data for LLMs.
In this session, we will zoom in on one element of LLM pretraining and explore how we can scale out fuzzy deduplication of many terabytes of documents. We can run a distributed Jaccard similarity workload by deploying a RAPIDS accelerated Dask cluster on Kubernetes to remove duplicate documents from our training set.
The landscape of Large Language Models (LLMs) has expanded rapidly, offering users a diverse range of options for text generation and analysis. However, the cost associated with utilizing these LLMs can turn out to be very expensive. During this presentation, I will delve into practical strategies aimed at achieving a delicate balance: reducing inference costs while simultaneously elevating model performance, enhancing quality, and optimizing latency. Additionally, I will discuss essential architectural principles for constructing LLM-based systems and products, alongside pragmatic methodologies to fine-tune open-source LLM models, enhancing their performance in specific use-cases. I will also explore some practical evaluation methods for benchmarking models against baseline standards, delve into embedding techniques for precise query classification, and unravel the intricacies of shot-prompting strategies to bolster adaptability to unfamiliar data.
Open source large language models (LLMs) are now inching towards matching the proficiency of proprietary models, such as GPT-4. In addition, operating your own LLMs can unveil advantages in aspects like data privacy, model customizability, and cost efficiency. However, running your own LLMs and realizing these benefits in a production environment is not easy - it necessitates a precise set of optimization and a robust infrastructure. Come to this talk to learn about the problems you might face when using your own large language models, and find out how OpenLLM can help you solve them.
The goal of this workshop is to address the gap between the development of technical work -- whether that's via research or more traditional data science work -- and its reproducibility by providing attendees with the necessary knowledge to get started creating Python packages. This means that, if you're a researcher (with basic Python knowledge) wanting to make your theories more accessible via code, or a data professional wanting to share your Python code inside or outside of your organization, this workshop will help you understand how to contribute to, and develop, open-source projects from scratch.
In 2023, with the introduction of Pandas2, Apache Arrow became the dominant standard for both in-memory representation and over-the-wire transfer format for data in DataFrames.
In this talk, we will examine the performance benefits of using Apache Arrow end-to-end from the data lake or warehouse to client-side DataFrames. We will demonstrate in Python examples how data can now be moved between Pandas2, Polars, and DuckDB at no cost (zero-copy) and we will look how Arrow enables the replacement of row-oriented APIs for data retrieval (JDBC/ODBC) with column-oriented protocols (Arrow Flight and ADBC). We will show how we built a query service that bridges the data lake with Python clients. DataFrame clients can read data using a network hosted service that reads Arrow data from Parquet files, processes the data in Arrow format, and transfers the data to clients using Arrow Flight service. We will also look to a file-free future for DataFrames, where they can be easily stored and updated in a serverless platform.
This talk examines using open-source LLMs for real-world purposes. It compares the benefits and drawbacks of open-source LLMs to proprietary options like OpenAI. The discussion covers the economics of hosting open-source LLMs, highlights serving frameworks, explores cloud GPU availability, and gives an overview of key open-source LLMs.
Join the sprint at https://numfocus-org.zoom.us/j/88237670803?pwd=vSKWQ3FULy7ufuXQgWOK3OO0pyRhhC.1
Sprint Leader
Jeremy Ravenel (https://github.com/jravenel)
This presentation explores the challenges, such as cost, latency, and security, faced when developing a new (Large Language Model) LLM App and presents solutions to these obstacles. You will learn how to build your own AI-enabled real-time data pipeline without complex and fragmented typical LLM stacks such as vector databases, frameworks, or caches. We will leverage an open-source LLM App library in Python to implement real-time in-memory data indexing directly reading data from any compatible storage, processing, analyzing, and sending it to output streams.
As we descend from the peak of the hype cycle around Large Language Models (LLMs), chat-based document interrogation systems have emerged as a high value practical use case. The ability to ask natural language questions and get relevant answers from a large corpus of documents has the potential to fundamentally transform organizations and make institutional knowledge accessible.
Retrieval-augmented generation (RAG) is a technique to make foundational LLMs more powerful and accurate, and a leading way to implement a personal or company-level chat-based document interrogation system. In this talk, we’ll understand RAG by creating a personal chat application. We’ll use a new OSS project called Ragna that provides a friendly Python and REST API, designed for this particular case. We’ll also demonstrate a web application that leverages the REST API built with Panel–a powerful OSS Python application development framework.
By the end of this talk, you will have an understanding of the fundamental components that form a RAG model as well as exposure to open source tools that can help you or your organization explore and build on your own applications.
A shift is a poetic word for uncertainty. Winds shift, rivers and sands drift, and people change. Coming to the not-so-poetic world of data science, what about the data? Data comes from systems and people using them, so it is natural that data too will see the rigors of shift too. A model that was trained and tested for particular dynamics may assume the expected uncertainty in the data such as a shift in the user behavior. But what happens when the shift goes beyond expectations? How do teams detect the different types of data drift? More so, how do they tackle the detected drift? In this talk, I will gently introduce you to data drift and how the industry tackles this issue.
Learn about the different approaches for training large-scale machine learning models using PyTorch.
Pandas is loved and venerated for its flexibility and ease-of-use. However, its oft-quoted slowness has prompted many others like duckdb, polars, and RAPIDS cuDF to step in and offer faster alternatives. These are all fantastic tools, but they have non-zero adoption costs, more restrictive APIs compared to pandas, and they don’t always work with 3rd party libraries that use pandas today.
cudf.pandas
takes a fresh approach: instead of trying to be a replacement for pandas, it effectively accelerates pandas on the GPU. cudf.pandas
requires no code changes (not even your pandas imports!), supports 100% of the pandas API, and third-party libraries that use pandas are magically accelerated on the GPU.
If you use pandas today and want to run your code on the GPU with 0 changes today, this talk is for you. If you are the maintainer of a library that uses pandas and you’d like to support GPUs with 0 changes today, this talk is for you. If you’re a Pythonista at heart and enjoy hearing about the proxy pattern and deep import customization, this talk is for you!
In this talk, we will cover the practical tools for modern machine learning for machine learning datasets, models, and demos. First, we will start by talking about How to Use the Hugging Face Hub, covering how to easily find the right models and datasets for your machine learning tasks. Then, we will walk through Building and Sharing ML Demos: covering how to quickly demo ML models for class presentations, portfolios, etc using the Gradio (www.gradio.dev) library.
The Python packaging ecosystem has a massive and diverse user community with various needs. A subset of this user base, data science and scientific computing communities, i.e., PyData communities, have historically relied on the conda package and environment management tools for their workflows. conda has robust solutions for packaging and distributing libraries and managing dependencies in environments, but there are still unsolved challenges for reliably reproducing runtime environments. For instance, compute-intensive R&D activities require certain reproducibility guarantees for collaborative development and ensure production-level tools' stability and integrity. Many teams lack proper documentation and dependable practices for installing and regenerating the same runtime conditions across their software pipelines and systems, leading to product instability and release and production delays.
In this talk, we will:
* Share reproducibility best practices for Python-based data science workflows. For this, we will present real-world examples where reproducibility was not a core requirement or consideration of the project but was introduced as an afterthought.
* Demonstrate a greenfield solution to this problem: conda-store, an open source project that ensures flexible yet reproducible environments with features like version control, role-based access control, and background enforcement of best practices, all the while incorporating a user-friendly user interface.
You will learn about all the variables that affect runtime conditions (like enumerating project dependencies and technical details about your operating system and hardware). We will also present a checklist of automated tasks that should be part of a reproducible workflow and the different packaging solutions in the PyData ecosystem with a deeper focus on conda-store. We hope to share the perspective of a downstream user of the packaging ecosystem and bring attention to the conversations around runtime-environment reproducibility.
This tutorial will introduce how to train machine learning models for time-to-event prediction tasks (health care, predictive maintenance, marketing, insurance...) without introducing a bias from censored training (and evaluation) data.
Historically it's been difficult to reuse existing batch processing code in streaming application.
Because of this, ML engineers had to maintain two implementations of their jobs.
One for streaming and one for batch.
In this talk we'll introduce beavers, a stream processing library optimized for analytics.
It can be used to run both batch and streaming jobs with minimal code duplication, whilst still being good at both.
In today's data-driven world, knowing how to gather and analyze information is more critical than ever. Join us for a compact session on using Python and Scrapy to crawl the web and solve real-time problems. We'll cover the basics, and then dive into a practical example of collecting apartment data from the internet.
How many fish are in the ocean? To answer this efficiently, we attempt to modernize fisheries operations to support interoperable and scalable sonar data processing by building user-friendly customizable Prefect workflows. We share our story to inform others considering ways to provide modern orchestration tools to users without a lot of technical experience.
In the spirit of constructive chaos, will be talking about data democratization - why it's important, what it means for organizations, and what's needed to make it happen.
Fully serverless systems are compelling for a number of reasons; they are inherently scalable, highly available and have a low maintenance burden. The challenge with a serverless system is providing sufficiently strong guarantees of data consistency without either sacrificing performance or simply shifting the burden of maintaining consistency to an external client-server system. At ArcticDB (https://github.com/man-group/arcticdb) we have spent years refining a fully serverless model that pushes the boundaries of what can be achieved with nothing but a python library and commodity object storage. In this talk we will share re-usable techniques for ensuring data reliability without external synchronization.
Unlock robust statistical inference for time series data with tsbootstrap, a new open source Python library implementing specialized bootstrapping techniques.
Did you know that the core Python syntax and semantics can be tailored for interactive computing use cases? It turns out that more is possible than what you would expect! For example, at the most basic level, Jupyter supports basic syntax extensions like so-called "magic" operations. It turns out, however, that one can go much deeper. In this talk, I'll show that it's possible to augment and abuse Python to support a plethora of interactive use cases. I'll start with the simple example of building an optional chainer for Python (supporting syntax reminiscent of javascript like a?.b()?.c). I'll then show how to use these same ideas to accelerate data science operations, concluding with an example of how to perform full dataflow tracking in order to give users the illusion of dataframe queries that run instantaneously.
Are you a member or leader of an open source community with open governance and a charitable mission?
Sit down and listen. Listen how to grow, nurture, and protect your community. Watch while it grows, takes off, and spreads its wings. Listen to stories of clear blue skies, joyful adventures, strange lands, and epic battles. And when you embark on your journey with your friends, keep these tales close to your heart. May they warn you of the mistakes of others, may they shield you from any danger that finds you. May they guide you towards the promised pastures green.
No dragons were harmed in the preparation of this talk, nor does it contain statements that could be construed libelous in any relevant jurisdiction.
Quarto Dashboards make it easy to create interactive dashboards using Python, R, Julia, and Observable:
You can publish a group of related data visualizations as a dashboard, using a wide variety of components including Plotly, Leaflet, Jupyter Widgets, htmlwidgets; static graphics (Matplotlib, Seaborn, ggplot2, etc.); tabular data; value boxes; and text annotations. It's also flexible and easy to specify row and column-based layouts. The components are intelligently re-sized to fill the browser and adapted for display on mobile devices. Finally, you can author using any notebook editor (JupyterLab, etc.) or in plain text markdown with any text editor (VS Code, RStudio, Neovim, etc.).
Dashboards can be deployed as static web pages (no special server required) or you can optionally integrate a backend Shiny Server for enhanced interactivity.
This talk presents Kùzu: a new open-sourced graph database management system (GDBMS) that is designed for Python graph data science (GDS) eco-system. Kùzu's embedded architecture makes it very easy to import as a library without a server setup and also provides performance advantages. Specifically users can: (i) ingest and model their application records in various raw file formats as a graph; (ii) query and transform these graphs using Cypher query language; and (iii) export graphs into popular Python GDS packages with no copy cost. We will live demo Kùzu's integration with NetworkX and Pytorch Geometric.
Join sprint at https://numfocus-org.zoom.us/j/88901164458?pwd=44hL3o0IAavVVfHeUBNwCp4Ykcc7Zc.1
Sprint Leader
Kyle Sunden (@ksunden on github)
Extreme events are ubiquitous, ranging from temperature records to stock market crashes or network outages. Using extreme weather events as an example we show how they can be modeled in a Bayesian way using PyMC. We start with simple models and ultimately move on to a more advanced model by implementing a Gaussian Process Latent Variable Model, which allows us to perform spatial modeling of extreme events.
This webinar will introduce machine learning pipelines and discuss their importance in building efficient and robust workflows. It will explain how pipelines help to prevent data leakage and ensure model stability by allowing for proper separation of training, validation, and test data. Through a blend of theory and practice, it will provide and explain code chunks in Python using well-known open-source packages like scikit-learn (pipeline and column transformers) and feature-engine to ensure a complete understanding of the .fit(), .transform(), and .predict() methods. By the end of this webinar, the audience will have a solid understanding of the theory behind machine learning pipelines and practical examples of using them effectively in their projects.
This session is designed for those who are curious about Keras and want to learn more about its capabilities for computer vision and stable diffusion. We will start with a refresher on the core deep learning concepts that are essential for understanding Keras. Then, we will dive into a quick introduction to Keras 3 with Jax, using object detection as an example. Next, we will explore how to use Keras CV and Keras 3 together for multi-framework modeling that includes . We will also discuss how to use pre-trained PyTorch models with Keras 3. Finally, we will wrap up with a discussion of stable diffusion, what it is, and how to implement it using Keras 3 and multi-framework modeling.
tsfresh is a popular time-series feature extraction library with over 7500 stars and thousands of downloads per day. tsfresh, however, is over 6 years old and suffers from slow performance and an outdated API. These features describe key characteristics of the time-series using algorithms from statistics, econometrics, signal processing, and non-linear dynamics.
That's why we open-sourced functime: a new high-performance time-series machine-learning library. What makes functime special is it's written in the ground-up with polars, which is currently the world's fastest dataframe library built on Apache Arrow and Rust.
functime recently rewrote 100s of features from tsfresh into Polars. The result? Up to 50x improvement in speed and memory efficiency compared to existing Pandas / Numpy implementations. functime is now the world's fastest time-series feature extraction library. Moreover, functime effortlessly parallelizes work for thousands of time-series using Polar's highly-optimized Rayon backend,. No distributed cluster (e.g. Spark). needed!
This talk begins with a brief introduction of time-series feature extraction and its use-cases. We then deep dive into the reasons why Polars is an optimal query engine for time-series feature engineering. We discuss the challenges and learnings from our rewrite. In particular, we will demonstrate, through code and benchmarks, lesser-known Polars tips and tricks to squeeze 10x speedups in your data engineering workflows.
The presentation describes a case study where Large Language Models were used to generate query-document relevance judgements. These judgements were then used to train Learning to Rank models which were used to rerank search results from an untuned engine, resulting in almost 20% increase in precision.
Every corner you look, everyone is talking about Large Language Models (LLMs).
Are you feeling a bit overwhelmed and looking for a simple intro and guided application of LLMs ?
Many internet companies have a search engine.
In this tutorial, we will cover practical use case of LLMS in improving a search engines such as
1) Understanding user intent in query
2) Checking if query is relevant to a document
3) Fine-tuning LLMs with custom corpus .
4) Updating the search engine documents with LLM knowledge.
This tutorial is meant to be beginner friendly and will focus on the practical use case.
No prior experience on search or advance machine learning needed.
Google Colab and an e-commerce dataset will be provided.
In this talk, we will introduce network science and demonstrate its usefulness in mining different types of data, including social network data, time series data, and spatiotemporal data. Our talk will include practical, hands-on examples of real-world problems we've solved in the developing world with tools from network science--including epidemic forecasting, stock market crash prediction, and food pricing trend analysis across regions. Python code will be available for those who want to run the analysis themselves.
Many problems can be reduced down to solving f(x)=0, maybe even more than you think! Solving a stiff differential equation? Finding out where the ball hits the ground? Solving an inverse problem to find the parameters to fit a model? In this talk we'll showcase how SciML's NonlinearSolve.jl is a general system for solving nonlinear equations and demonstrate its ability to efficiently handle these kinds of problems with high stability and performance. We will focus on how compilers are being integrated into the numerical stack so that many of the things that were manual before, such as defining sparsity patterns, Jacobians, and adjoints, are all automated out-of-the-box making it greatly outperform purely numerical codes like SciPy or NLsolve.jl.
This talk explores a framework for how data scientists can deliver value with Generative AI: How can you embed LLMs and foundation models into your pre-existing software stack? How can you do so using Open Source Python? What changes about the production machine learning stack and what remains the same?