12-08, 18:30–19:00 (UTC), Data Track
tsfresh is a popular time-series feature extraction library with over 7500 stars and thousands of downloads per day. tsfresh, however, is over 6 years old and suffers from slow performance and an outdated API. These features describe key characteristics of the time-series using algorithms from statistics, econometrics, signal processing, and non-linear dynamics.
That's why we open-sourced functime: a new high-performance time-series machine-learning library. What makes functime special is it's written in the ground-up with polars, which is currently the world's fastest dataframe library built on Apache Arrow and Rust.
functime recently rewrote 100s of features from tsfresh into Polars. The result? Up to 50x improvement in speed and memory efficiency compared to existing Pandas / Numpy implementations. functime is now the world's fastest time-series feature extraction library. Moreover, functime effortlessly parallelizes work for thousands of time-series using Polar's highly-optimized Rayon backend,. No distributed cluster (e.g. Spark). needed!
This talk begins with a brief introduction of time-series feature extraction and its use-cases. We then deep dive into the reasons why Polars is an optimal query engine for time-series feature engineering. We discuss the challenges and learnings from our rewrite. In particular, we will demonstrate, through code and benchmarks, lesser-known Polars tips and tricks to squeeze 10x speedups in your data engineering workflows.
This talk is organised around answering the following questions:
- [0:00-2:00] What is time-series feature extraction and its use-cases
- [2:00-3:00] What made tsfresh so popular amongst time-series data scientists
- [3:00-8:00] Why functime rewrote tsfresh in Polars
- [10:00-12:00] How much faster is functime compared to tsfresh? Benchmarks!
- [12:00-14:00] What challenges did we face?
- [14:00-24:00] What are the top 10 learnings about Polars and time-series data engineering from the rewrite?*
- [24:00-25:00] Why functime should be the new go-to library for time-series feature extraction
- [25:00-26:00] Why you should consider Polars for your next data engineering project!
- [26:00-30:00] Questions
*Some nuggets of learnings include: "Why dot product is underutilized for performance gains", "Lazy operations > eager", "Common subplan elimination is Polars' magic power", and more.
The talk assumes a basic understanding of DataFrame APIs, e.g. Pandas and Polars. It will be engineering heavy, but audiences will hopefully come out as better data engineers. Lastly, we believe functime will be the first of many Polars-based machine learning libraries. This talk will serve as motivation and a guide for future Polars-backed libraries!
Link to repo: https://github.com/TracecatHQ/functime
Previous knowledge expected
Chris Lo is the co-founder of Tracecat (YC W24): an AI-native monitoring platform for cyber threat hunters and detection engineers.
Software engineer