12-08, 12:30–13:00 (UTC), LLM Track
Open source large language models (LLMs) are now inching towards matching the proficiency of proprietary models, such as GPT-4. In addition, operating your own LLMs can unveil advantages in aspects like data privacy, model customizability, and cost efficiency. However, running your own LLMs and realizing these benefits in a production environment is not easy - it necessitates a precise set of optimization and a robust infrastructure. Come to this talk to learn about the problems you might face when using your own large language models, and find out how OpenLLM can help you solve them.
With the release of Llama 2, the performance of open source models are closing the gap to the level of GPT-4. Remarkably, for tasks that require low level of reasoning, fine-tuned LLMs with smaller parameter sizes can significantly out perform their proprietary counterparts. In addition, running your own LLMs guarantees data privacy knowing that your sensitive data will never be leaked or used in training. Lastly, since you only pay for compute resources as opposed to tokens, a better economy of scale can be achieved by leveraging smaller models and efficient scaling of the underlying resources.
However, running your own open source LLMs comes with many challenges. We will deep dive into five categories:
- Operability: Can I deploy and serve the model reliably on hardware I have available?
- Scalability: Can I scale serving instances elastically while maintain high availability?
- Throughput: Can I respond to a large number of concurrent text generation efficiently?
- Latency: Can I respond with a reasonable delay to both APIs and human users?
- Cost: How much do I need to spend to run my open source LLMs?
With these challenges in mind, we want to help AI developers more efficiently deploy and operate their own LLMs. We started the open source project, OpenLLM, that aims to solve the above challenges by providing users the options to optimize LLM serving through techniques like quantization, continuous batching, model parallelism, kernel optimization, and token streaming. We aim to package all state of the art LLM serving optimizations in a single easy to use Python package. In a domain that is both fast-evolving and densely researched like LLM inference, OpenLLM facilitates AI developers to center their attention on core tasks without the need to stay abreast and integrate the latest optimizations continuously.
No previous knowledge expected
Sean currently serves as the Head of Engineering at BentoML. He has led the team to successfully release multiple open-source projects, including BentoML and OpenLLM, aimed to help facilitate AI application development. Additionally, Sean has also led the launch of the AI deployment platform BentoCloud, designed for deploying and scaling AI applications in production. Prior to his role at BentoML, he led engineering teams at LinkedIn, where he supported the service infrastructure powering all of LinkedIn's backend services.