PyData Global 2023

Francesc Alted

I am a curious person who studied Physics and Math when I was young. Through the years, I developed a passion for handling large datasets and using compression to enable their analysis using regular hardware that is accessible to everyone.

I am leading the Blosc Development Team, and currently interested in determining, ahead of time, which combinations of codecs and filters can provide a personalized compression experience. This way, users can choose whether they prefer a higher compression ratio, faster compression speed, or a balance between both.

Last, but not least, I have recently been awarded with the "2023 Project Sustainability Award" from NumFOCUS.

You can know more on what I am working on by reading my latest blogs.

The speaker's profile picture

Sessions

12-06
11:30
90min
Btune: Making Compression Better
Francesc Alted

Data compression is not a one-codec-fits-all problem. It necessarily involves a trade-off between compression ratio and speed. A higher compression ratio usually results in a slower compression process. Depending on the needs, one may want to prioritize one over the other. The issue is that finding the optimal compression parameters can be a slow process due to the large number of combinations of compression parameters (codec, compression level, filter, split mode, number of threads, etc.), and it may require a significant amount of manual trial and error to find the best combinations.

Btune (https://btune.blosc.org) is a dynamic plugin for Blosc2 that can help finding the optimal combination of compression parameters for datasets compressed with Blosc2 (https://github.com/Blosc/c-blosc2, https://github.com/Blosc/python-blosc2), while significantly speeding up this process.

Data Track
Data Track
12-06
13:30
30min
Blosc2: Fast And Flexible Handling Of N-Dimensional and Sparse Datasets
Francesc Alted

N-dimensional datasets are pervasive in many scientific areas, and getting quick slices of them is critical for an improved exploration experience. Blosc2 is a compression and format library that recently gained support for dealing with such multidimensional datasets. Crucially important, by leveraging compression, Blosc2 can deal with sparse datasets effectively as the zeroed parts are almost suppressed, whereas the non-zero parts can still be stored in smaller sizes than non-compressed counterparts. In addition, the new double data partition inside Blosc2 minimizes the decompression of unnecessary data and provides top-class slicing speed.

Data Track
Data Track