PyData Global 2023

Optimize first, parallelize second: a better path to faster data processing
12-07, 18:00–18:30 (UTC), Data Track

You’re processing a large amount of data with Python, and your code is too slow.
One obvious way to get faster results is adding multithreading or multiprocessing, so you can use multiple CPU cores.
Unfortunately, switching straight to parallelism is almost always premature, often unnecessary, and sometimes impossible.
We'll cover the different goals for performance, why parallelism only achieves one of them, the costs of parallelism, and the alternative: speeding up your code first.


You’re processing a large amount of data with Python, and your code is too slow.
One obvious way to getting faster results is adding multithreading or multiprocessing, so you can use multiple CPU cores.
Unfortunately, switching straight to parallelism is almost always premature, often unnecessary, and sometimes impossible.

In this talk we'll:

  • Consider two different goals for performance: faster results and reduced hardware costs. Parallelism only gives you the former.
  • Consider other limitations of parallelism.
  • Go over some of the ways you can speed up your code before you consider parallelism, from better algorithms to the many different ways you can optimize your code.

Prior Knowledge Expected

Previous knowledge expected

Itamar is the creator of Sciagraph, a performance and memory profiler for Python data science processing. He is working on a book for data scientists and scientists who use Python about how to speed up low-level code. He writes about Python performance, Docker packaging, and more at https://pythonspeed.com.