12-06, 11:30–13:00 (UTC), Data Track
Data compression is not a one-codec-fits-all problem. It necessarily involves a trade-off between compression ratio and speed. A higher compression ratio usually results in a slower compression process. Depending on the needs, one may want to prioritize one over the other. The issue is that finding the optimal compression parameters can be a slow process due to the large number of combinations of compression parameters (codec, compression level, filter, split mode, number of threads, etc.), and it may require a significant amount of manual trial and error to find the best combinations.
Btune (https://btune.blosc.org) is a dynamic plugin for Blosc2 that can help finding the optimal combination of compression parameters for datasets compressed with Blosc2 (https://github.com/Blosc/c-blosc2, https://github.com/Blosc/python-blosc2), while significantly speeding up this process.
When you have to compress lots of data, optimizing the compression parameters can be a daunting task. For instance, if you are storing data from high-speed data acquisition systems, you may want to prioritize compression speed over compression ratio. This is because you will be writing data at speeds near the capacity of your systems. On the other hand, if the goal is to access the data repeatedly from a file system, you may want to prioritize decompression speed over compression ratio for optimal performance.
Btune (https://btune.blosc.org) is a tool for Blosc2 compressor that helps to find the optimal combination of compression parameters. Depending on the needs, Btune has different tiers of support for tuning datasets. In this tutorial we are going to exercise the free tier (Btune free). In this mode, a genetic algorithm tests different combinations of compression parameters to meet the user's requirements for both compression ratio and speed for each chunk in the dataset. It assigns a score to each combination and, after a number of iterations, the software stops and uses the best score (minimal value) found for the rest of the dataset.
The type of the tutorial will be hands-on after a gentle intro to Btune. It is addressed to data scientists that need to cope with large datasets that can be tabular, N-dimensional and/or sparse. Some knowledge of data handling (using h5py, NetCDF, PyTables or Zarr) would be interesting, but not actually necessary.
Main takeaways for attendees:
- Description how Btune free works
- How to find optimal compression/decompression parameters for a specific dataset
- Provide useful advices on the best codecs and filters for different kinds of datasets
The materials for the tutorial will be available via a GitHub repository and based on a stripped down version of this one (where a more complete demonstration of the three different Btune tiers of support is shown): https://github.com/Blosc/Btune-tutorial
Finally, users wanting to explore the best compression codecs/filters for their cases are encouraged to bring their own datasets and use the techniques learnt for finding them out.
The requirements for following the tutorial will be:
- Laptop (or a remote machine)
- Operating systems supported
- Linux
- MacOS
- Windows: only via WSL. Please install it prior the tutorial; instructions here:
https://learn.microsoft.com/en-us/windows/wsl/install - Pyenv/Conda/mamba environment with a Python 3.10, 3.11 or 3.12 installed
- Access to a Unix shell (preferably bash)
No previous knowledge expected
I am a curious person who studied Physics and Math when I was young. Through the years, I developed a passion for handling large datasets and using compression to enable their analysis using regular hardware that is accessible to everyone.
I am leading the Blosc Development Team, and currently interested in determining, ahead of time, which combinations of codecs and filters can provide a personalized compression experience. This way, users can choose whether they prefer a higher compression ratio, faster compression speed, or a balance between both.
Last, but not least, I have recently been awarded with the "2023 Project Sustainability Award" from NumFOCUS.
You can know more on what I am working on by reading my latest blogs.