PyData Global 2023

Data persistence with consistency and performance in a truly serverless system
12-08, 17:00–17:30 (UTC), Data Track

Fully serverless systems are compelling for a number of reasons; they are inherently scalable, highly available and have a low maintenance burden. The challenge with a serverless system is providing sufficiently strong guarantees of data consistency without either sacrificing performance or simply shifting the burden of maintaining consistency to an external client-server system. At ArcticDB (https://github.com/man-group/arcticdb) we have spent years refining a fully serverless model that pushes the boundaries of what can be achieved with nothing but a python library and commodity object storage. In this talk we will share re-usable techniques for ensuring data reliability without external synchronization.


Data scientists and quants who need to store valuable data in the long-term , particularly on cloud object storage, traditionally had two choices, either use a file format or a client-server database system. The former is sufficient provided the number of objects is small and the data is updated infrequently, however tracking thousands of revisions across millions of dataframes soon becomes an onerous task, and simultanous modifications are a potential source of data corruption. Client-server solutions are great at tracking many objects and providing strong consistency guarantees, but can be expensive to run and constitute a single point of failure leading to the unavailibility and even loss of critical data.

On modern object stores reading and writing is fast, but discovery operations like listing are potentially slow, and also provide only eventual consistency. This has led to the development of hybrid embedded/client-server solutions that store user data in the cloud, with some or all of their metadata in a traditional database, providing good data consistency guarantees but sacrificing many of the benefits of a purely embedded solution.

At ArcticDB we have been pushing the boundaries of a truly embedded data model as far as we can, and in this talk we will describe the methods we have created that allow us to scale data processing out to thousands of workers, whilst ensuring that readers never see broken or partially updated information. You will learn the techniques that we have developed to provide things like point-in-time snapshots, time-travel and batch operations over a whole universe of dataframes, with the reliability required to trade billions of dollars, using nothing but Python and a key-value store.


Prior Knowledge Expected

No previous knowledge expected

William Dealtry has been working in both Python and C++ for many years, and has been a member of the C++ standardization committee for more than a decade. Having previously worked with financial data a places like the New York Stock Exchange and Goldman Sachs, he is currently the Architect of a new open-source Dataframe database, ArcticDB, which is backed by long-time Python enthusiasts Man Group and Bloomberg.