12-08, 16:00–17:30 (UTC), Machine Learning Track
This tutorial will introduce how to train machine learning models for time-to-event prediction tasks (health care, predictive maintenance, marketing, insurance...) without introducing a bias from censored training (and evaluation) data.
Main tutorial notebook:
- https://vincent-maladiere.github.io/survival-analysis-demo/lab/index.html?path=tutorial_part_1.ipynb
According to Wikipedia:
Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as deaths in biological organisms and failure in mechanical systems. [...]. Survival analysis attempts to answer certain questions, such as what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?
In this tutorial we will deep dive into a practical case study of predictive maintenance using tools from the scientific Python ecosystem. Here is a tentative agenda:
- What is time-censored data and why it is a problem to train time-to-event regression models.
- Single event survival analysis with Kaplan-Meier using scikit-survival.
- Evaluation of the calibration of survival analysis estimators using the integrated brier score (IBS) metric.
- Predictive survival analysis modeling with Cox Proportional Hazards, Survival Forests using scikit-survival, GradientBoostedIBS implemented from scratch with scikit-learn.
- How to use a trained GradientBoostedIBS model to estimate the median survival time and the probability of survival at a fixed time horizon.
- Inspecting the learned statistical association between input features and survival probabilities using a partial dependence plot.
The tutorial notebooks also contain additional material that we probably won't have time to present in 90 min, namely:
- Competing risks modeling with Nelson–Aalen, Aalen-Johansen using lifelines.
- Estimation of the cause-specific cumulative incidence function (CIF) using our GradientBoostedIBS model.
- Extracting implicit failure data from operation logs using sessionization with Ibis and DuckDB.
Target audience: good familiarity with machine learning concepts, with prior experience using scikit-learn (you know what cross-validation means and how to fit a Random Forest on a Pandas dataframe).
Previous knowledge expected
Machine Learning software engineer at Inria and member of the maintainers' team of the scikit-learn open source project.