PyData Global 2023

Extremes, outliers, and GOATS: on life in a lognormal world
12-06, 16:00–16:30 (UTC), General Track

The fastest runners are much faster than we expect from a Gaussian distribution, and the best chess players are much better. In almost every field of human endeavor, there are outliers who stand out even among the most talented people in the world. Where do they come from?

In this talk, I present as possible explanations two data-generating processes that yield lognormal distributions, and show that these models describe many real-world scenarios in natural and social sciences, engineering, and business. And I suggest methods -- using SciPy tools -- for identifying these distributions, estimating their parameters, and generating predictions.


One of the most frequently asked questions in statistics forums is how to deal with outliers. The answer depends on where they came from. If we think they are the result of measurement error, they should probably be discarded. But if they are a natural outcome of the system that produced the data, they should be retained -- and they might be critical to modeling and predicting the behavior of the system.

To decide what to do with outliers, we need domain knowledge, but we also need modeling tools. In this talk I suggest two models that yield extreme values and outliers: multiplicative growth and "weakest link" limiting factors. I show that these models are a good fit for data from a wide range of natural and engineered systems.

These data-generating processes might explain why elite athletes are so much better than average, and why even among elites, there is often an uncontested GOAT (greatest of all time). Closer to home, these processes inform business and life decisions related to tradeoffs between exploration and exploitation.

Outline:
* Running and chess: why are the elites so elite?
* Birth weight is Gaussian, but adult weight is lognormal: a model of multiplicative growth
* The limiting factors of running speed: a model of the "weakest link"
* Computational tools: identifying and fitting lognormal models
* Outliers and 10,000 hours: multiplicative growth, weakest link, or both?
* The GOAT of all GOATs: Marion Tinsley
* What should you do? exploration and exploitation in a lognormal world


Prior Knowledge Expected

No previous knowledge expected

Allen Downey is a curriculum designer at Brilliant.org and professor emeritus at Olin College.
He is the author of several books -- including Think Python, Think Bayes, and Probably Overthinking It -- and a blog about data science and Bayesian statistics. He received a Ph.D. in computer science from the University of California, Berkeley; and Bachelor's and Masters degrees from MIT.