PyData Global 2023

Architecting Data Tools: A Roadmap for Turning Theory and Data Projects into Python Packages
12-08, 13:00–15:00 (UTC), General Track

The goal of this workshop is to address the gap between the development of technical work -- whether that's via research or more traditional data science work -- and its reproducibility by providing attendees with the necessary knowledge to get started creating Python packages. This means that, if you're a researcher (with basic Python knowledge) wanting to make your theories more accessible via code, or a data professional wanting to share your Python code inside or outside of your organization, this workshop will help you understand how to contribute to, and develop, open-source projects from scratch.


Reproducibility in traditional research and industry can be tricky, but those who succeed in the former group can go on to kick-start a company around their work (e.g., snorkel.ai, cleanlab.ai, Databricks, explosion.ai, and the list goes on), and those in the latter group go on to build tools that not only form a community around them, but that also enables the creation of new companies -- e.g. Lyft released Flyte and Union.ai was formed around it, and Netflix released Metaflow and Outerbounds was formed around it. That said, this workshop seeks to give participants an entry point into either group by helping them make their research and data-related work accessible to themselves and others via code.

We will start the workshop with a short 10-minute presentation while everyone gets set up and within it, we will go over what reproducibility means for research and software, and we will then move on to describing how packaging works in Python. The sections thereafter are split into three parts that cover different topics and tools, and each of these will be about 25 minutes long. In the first section, we learn how to create a Python package for a theoretical research paper, and, in the second section, we will refactor a data science project into a Python package to be shared with a data team. In the last part, we will review different strategies for testing, releasing, and maintaining our work before and after we open-source it.

Some of the tools we will use in this workshop include cookiecutter, mkdocs, setuptools, nbdev, numpy, scipy, scikit-learn, pandas, pytest, and ibis, plus a few more. Some of the programming paradigms and topics we will touch on include object-oriented and functional programming, decorators, version control, machine learning, experimentation, and a few more.

By the end of this workshop, attendees will have a better understanding, and the practical skills to, create their own open-source projects and/or contribute to their favorite ones. So, if you want to learn how to write code in a reproducible way before you publish your research, this workshop is for you. If you are a data professional copying and pasting code from old projects into your new ones, this workshop is for you. Lastly, if you want to gain software engineering skills to become more productive in your day-to-day work, then, this workshop is for you.


Prior Knowledge Expected

No previous knowledge expected

Ramon is currently a developer advocate at Seldon. Before joining Seldon, he worked as an independent freelance data professional and as a Senior Product Developer at Decoded, where he created custom data science tools, workshops, and training programs for clients in various industries. Going a bit further back, Ramon used to wear different research hats in the areas of entrepreneurship, strategy, consumer behavior, and development economics in industry and academia. Outside of work, he enjoys giving talks and technical workshops and has participated in several conferences and meetup events. In his free time, you will most likely find him traveling to new places, mountain biking, or both.