12-08, 16:00–16:30 (UTC), General Track
The Python packaging ecosystem has a massive and diverse user community with various needs. A subset of this user base, data science and scientific computing communities, i.e., PyData communities, have historically relied on the conda package and environment management tools for their workflows. conda has robust solutions for packaging and distributing libraries and managing dependencies in environments, but there are still unsolved challenges for reliably reproducing runtime environments. For instance, compute-intensive R&D activities require certain reproducibility guarantees for collaborative development and ensure production-level tools' stability and integrity. Many teams lack proper documentation and dependable practices for installing and regenerating the same runtime conditions across their software pipelines and systems, leading to product instability and release and production delays.
In this talk, we will:
* Share reproducibility best practices for Python-based data science workflows. For this, we will present real-world examples where reproducibility was not a core requirement or consideration of the project but was introduced as an afterthought.
* Demonstrate a greenfield solution to this problem: conda-store, an open source project that ensures flexible yet reproducible environments with features like version control, role-based access control, and background enforcement of best practices, all the while incorporating a user-friendly user interface.
You will learn about all the variables that affect runtime conditions (like enumerating project dependencies and technical details about your operating system and hardware). We will also present a checklist of automated tasks that should be part of a reproducible workflow and the different packaging solutions in the PyData ecosystem with a deeper focus on conda-store. We hope to share the perspective of a downstream user of the packaging ecosystem and bring attention to the conversations around runtime-environment reproducibility.
The PyData ecosystem, which includes Python libraries for scientific computing and machine learning, is at the heart of the rapidly advancing data science landscape. The conda package and environment management ecosystem is the leading solution for the first couple of steps in a data science workflow: creating isolated environments with the necessary PyData packages and sharing them with your team. However, some friction points currently exist in this stage; for example, have you ever found yourself in any of the following situations?
You just got a new flashy laptop with a faster and more efficient CPU that uses a different architecture than your previous machine. You are installing your usual stack, but again, some dependencies are unavailable. Are there any workarounds?
You have been working hard on a Jupyter Notebook to answer an important scientific question. When the time comes to share the results with your colleagues, supervisors, or the broader community, the excitement is gone as they tell you: “It doesn’t work,” “It doesn’t even start,” “I get different results.” You have sent pretty detailed instructions on how to set things up, so what’s the issue?!
The IT team in your company has decided to change the default operating system for security reasons. It’s OK because your computational experiments are appropriately annotated with their dependencies, but when re-creating your virtual environment, some packages cannot be found?
These situations pull from a common thread, the ever-so-dreaded: “It works on my machine” because of uncontrolled variables in the supply chain.
We’ll start the talk by studying all the moving pieces behind running a Python script or notebook. Then, we will discuss common reproducibility pitfalls, best practices, and recommended workflows for annotating dependencies and runtime conditions. We’ll also review how the ‘conda’ and ‘pip’ ecosystems can play nicely together.
Finally, we will share conda-store – a project built on top of the conda ecosystem and designed to enforce these reproducibility standards with little overhead. This part of the talk will showcase implementation details and demonstrate some of its features, including:
* Behind-the-scenes enforcement of dependency and environment management best practices.
* Generation of reproducibly sharable artifacts like lockfiles, Docker images, and tarballs.
* Access with an intuitive graphical interface for non-developers to create, manage, and share environments and how to version control environments for reliable builds.
We aim to share and showcase the different options in the packaging ecosystem for ensuring runtime reproducibility and extend the packaging-related conversations at the conference.
Previous knowledge expected