Blosc2: Fast And Flexible Handling Of N-Dimensional and Sparse Datasets PyData Global 2023

Blosc2: Fast And Flexible Handling Of N-Dimensional and Sparse Datasets
.ical

12-06, 13:30–14:00 (UTC), Data Track

N-dimensional datasets are pervasive in many scientific areas, and getting quick slices of them is critical for an improved exploration experience. Blosc2 is a compression and format library that recently gained support for dealing with such multidimensional datasets. Crucially important, by leveraging compression, Blosc2 can deal with sparse datasets effectively as the zeroed parts are almost suppressed, whereas the non-zero parts can still be stored in smaller sizes than non-compressed counterparts. In addition, the new double data partition inside Blosc2 minimizes the decompression of unnecessary data and provides top-class slicing speed.

We will describe the new improvements of Blosc, a high performance compressor optimized for binary data (floating point numbers, integers and booleans), although it can handle string data too. Blosc is widely used in popular storage libraries like HDF5 (via h5py or PyTables) or Zarr, probably producing many petabytes of compressed data every day around the world.

C-Blosc2 (https://github.com/Blosc/c-blosc2) is the new major version of C-Blosc, and it comes with Python-Blosc2 (https://github.com/Blosc/python-blosc2), a shallow Python wrapper that exposes a lot of its new features; among the most interesting ones we can mention:

64-bit containers: no practical limit in dataset sizes.
Frames: allow to serialize data either on-disk or in-memory.
Meta-layers: add meta-data in different layers inside frames.
Blosc2 NDim: create, read and slice n-dimensional datasets in an efficient way.
Double partitioning: split data in fine-grained cubes for faster reads of n-dim slices.
Parallel reads: when several blocks of a chunk are to be read, this is done in parallel.
Support for special values: large sequences of repeated values can be represented efficiently.

With leveraging these features, Blosc2 provides a powerful, yet flexible tool for data handling. For example, when Blosc2 cooperates with libraries like PyTables/HDF5, it allows to query 100 trillion rows tables in human time frames.

The type of the talk will be a gentle introduction to Blosc2, and is addressed to data scientists that need to cope with large datasets that are N-dimensional and/or sparse. No prior knowledge is required other than Python itself.

Main takeaways for attendees:

Description the main features of Blosc2
Provide useful advices on the best codecs and filters for different kinds of datasets
How to partition multidimensional datasets for slicing them efficiently
Comparison with other packages (h5py, PyTables, Zarr) in terms of efficiency and resource saving

Finally, we will show an example of exploring the Milky Way 3-dim dataset effectively (using data from the Gaia mission).

Prior Knowledge Expected –

No previous knowledge expected

Francesc Alted

I am a curious person who studied Physics and Math when I was young. Through the years, I developed a passion for handling large datasets and using compression to enable their analysis using regular hardware that is accessible to everyone.

I am leading the Blosc Development Team, and currently interested in determining, ahead of time, which combinations of codecs and filters can provide a personalized compression experience. This way, users can choose whether they prefer a higher compression ratio, faster compression speed, or a balance between both.

Last, but not least, I have recently been awarded with the "2023 Project Sustainability Award" from NumFOCUS.

You can know more on what I am working on by reading my latest blogs.

This speaker also appears in:

Btune: Making Compression Better

Blosc2: Fast And Flexible Handling Of N-Dimensional and Sparse Datasets .ical 12-06, 13:30–14:00 (UTC), Data Track

Blosc2: Fast And Flexible Handling Of N-Dimensional and Sparse Datasets
.ical

12-06, 13:30–14:00 (UTC), Data Track