PyData Global 2023

Improving Open Data Quality using Python
12-06, 16:00–17:30 (UTC), Machine Learning Track

In this session we will demonstrate how to measure and improve the quality of open data using the open source Python library Great Expectations. Attendees will learn quality testing techniques and methodologies to prepare high-quality longitudinal datasets using Open Data from cities and regional portals.


Open data quality is crucial for effective analysis and decision making. However, real-world open data often contains dirty data, formatting inconsistencies, and anomalies. This tutorial will demonstrate data quality testing using the Great Expectations Python library and some common remediation techniques using pandas.

Given the context of multi-year dataset published by public bodies and admnistrations, we will explore how to produce longitudinal datasets, improving consistency over individual yearly files.

The tutorial will cover:

1) Introduction to data quality concepts (10 mins)

2) Validation of single datasets (40 mins)
- Basic Exploratory Data Analysis
- Profile data & identify anomalies with Great Expectations
- Fixing problems detected using pandas
- Assesing data quality improvements after remediation steps were taken

3) Preparing longitudinal data (30 minutes)
- Challenges of constructing a unified dataset from multiple individual files
- How to use Great Expectations to check for inconsistencies across files
- Alternative approaches for unfixable issues in data

4) Q&A (10 minutes)

The target audience is data analysts, scientists and engineers, as well as citizen scientists who work with open data. Attendees should have basic Python and pandas skills. They should know how to use Jupyter notebooks to execute and customize the examples provided. The tutorial materials and examples will be hosted in a public Github repo for attendees to follow along.

Attendees will learn quality assurance techniques to profile, validate, and improve open data. The tutorial will equip them with reusable, standards aligned workflows to prepare high-quality, analysis-ready open data.


Prior Knowledge Expected

Previous knowledge expected

César García Sáez. Computer Systems Engineer and MSc in Data Science at Universitat Oberta de Catalunya (UOC). His Master's Thesis focused on how to improve the data quality of open data datasets.

Speaker and researcher specialized on digital fabrication, Internet of Things, and open data. 14+ years working in IT as System Administrator at Madrid City Council IT department, before becoming an independent researcher. Passionate about the potential of open source technologies and open data to improve how do we relate with our cities, in a more profound, meaningful way.

Graduated at FabAcademy 2013 digital fabrication course. Co-founder of Makespace Madrid. He has been documenting and sharing stories about the maker movement for the last years at La Hora Maker, a podcast/youtube channel.