12-07, 16:00–18:00 (UTC), Data Track
While most scientists aren't at the scale of black hole imaging research teams that analyze Petabytes of data every day, you can easily fall into a situation where your laptop doesn't have quite enough power to do the analytics you need.
In this hands-on tutorial, you will learn the fundamentals of analyzing massive datasets with real-world examples on actual powerful machines on a cloud provided by the presenter – starting from how the data is stored and read, to how it is processed and visualized.
"Big data" refers to any data that is too large to handle comfortably with your current tools and infrastructure. As the leading language for data science, Python has many mature options that allow you to work with datasets that are orders of magnitudes larger than what can fit into a typical laptop's memory.
This tutorial will help you understand how large-scale analysis differs from local workflows, the unique challenges associated with scale, and some best practices to work productively with your data.
By the end, you will be able to answer:
- What makes some data formats more efficient at scale?
- Why, how, and when (and when not) to leverage parallel and distributed computation (primarily with Dask) for your work?
- How to manage cloud storage, resources, and costs effectively?
- How can interactive visualization make large and complex data more understandable (primarily with hvPlot)?
The tutorial focuses on the reasoning, intuition, and best practices around big data workflows, while covering the practical details of Python libraries like Dask and hvPlot that are great at handling large data. It includes plenty of exercises to help you build a foundational understanding.
We expect you to have some familiarity with Python programming in a data science context. If you know how to create and import Python functions and have some experience doing exploratory data analysis with pandas or NumPy, you will be able to follow along with the tutorial comfortably.
Previous knowledge expected