PyData Global 2023

Introduction to Machine Learning Pipelines: How to Prevent Data Leakage and Build Efficient Workflows
12-08, 18:30–20:30 (UTC), Machine Learning Track

This webinar will introduce machine learning pipelines and discuss their importance in building efficient and robust workflows. It will explain how pipelines help to prevent data leakage and ensure model stability by allowing for proper separation of training, validation, and test data. Through a blend of theory and practice, it will provide and explain code chunks in Python using well-known open-source packages like scikit-learn (pipeline and column transformers) and feature-engine to ensure a complete understanding of the .fit(), .transform(), and .predict() methods. By the end of this webinar, the audience will have a solid understanding of the theory behind machine learning pipelines and practical examples of using them effectively in their projects.


Objective:
By the end of this session, participants will have a comprehensive understanding of machine learning pipelines, equipped with the knowledge to prevent data leakage and build efficient, robust ML models using a range of tools and libraries relevant to today's data science landscape (focus on scikit-learn).

Introduction:

This workshop will commence with an overview of the machine learning (ML) project lifecycle, emphasizing the critical role of model validation. Key attention will be given to defining and understanding data leakage – its causes, consequences, and prevention strategies.

Hands-on learning:

Participants will engage in practical exercises demonstrating the nuances of data preprocessing and model training. These activities are designed to illustrate the occurrence of data leakage, how to identify it, and effective strategies to prevent it.

The workshop will compare and contrast methods to avert data leakage, both with and without the use of pipelines. Participants will gain insights into the additional steps required when not using pipelines and understand the benefits of implementing the scikit-learn pipeline in their workflows.

We will explore how essential libraries like Scikit-learn, Feature-engine, and Imbalanced-learn, integrate into this process. The workshop will provide practical examples using these tools, from basic to relatively complex pipeline implementations.

Towards the conclusion, the workshop will delve into how advanced autoML libraries like PyCaret employ pipelines. Additionally, we will quickly explore the application of pipeline conventions in other libraries, such as Spark ML, to highlight the widespread use and importance of this approach in various ML ecosystems.


Prior Knowledge Expected

Previous knowledge expected

Hi, my name is Cainã,
I'm the father of a human, a dog, and a cat.
I love traveling and gathering around with family and friends.

Professionally speaking, as a data scientist with a PhD in bioinformatics and over ten years of working on relevant projects, I developed a strong data science and analytics foundation. I have spent the last few years working at world‑renowned companies, developing end‑to‑end machine learning applications. Additionally, driven by my passion for knowledge, I've taught specialized courses in various data science topics. I am always eager to apply my expertise and create meaningful impacts.