PyData Global 2023

LanceDB: lightweight billion-scale vector search for multimodal AI
12-06, 19:00–19:30 (UTC), Data Track

With LanceDB you can make your laptop more powerful than any distributed vector database for semantic search. LanceDB is an open-source embedded vector database. It's lightweight like SQLite but powerful enough to deliver real-time semantic search over a billion vectors on a laptop.
LanceDB is backed by Lance columnar format, which delivers up to 100x performance improvement over parquet for managing multimodal AI data (e.g., vectors, images, point clouds, and more). With it, Lance gives AI teams a high performance single-source of truth across the whole AI life-cycle from analytics to training to debugging.

In this talk we'll cover the use cases for production inference and in the data lake. We'll talk about the technical details of the Lance columnar format and what makes it different. And we'll show a demonstration of LanceDB for multi-modal semantic search.


Lance format is different from parquet in several important aspects:
1. Data layout makes Lance fast for both scans and random access
2. Different IO plan and configurations optimized for large binary blob data
3. Indexing built-in for fast vector search
4. Zero-copy schema evolution for full reproducibility and instant roll-back

This talk is designed an intermediate audience. We assume you know python basics and the technical details will be more meaningful if you have basic knowledge of data systems (e.g., what is a columnar format, why is parquet / arrow, etc).


Prior Knowledge Expected

Previous knowledge expected

Chang is the CEO / Co-founder of LanceDB and has been building data science / machine learning tooling for almost two decades. Previously he was VP of Eng at TubiTV where he focused on recommender systems, MLOps, and experimentation. A long long time ago, in a galaxy far far away, he was one of the original co-authors of pandas.