12-07, 18:30–19:00 (UTC), Machine Learning Track
When training models on large datasets, one of the biggest challenges is low GPU utilization. These powerful processors are often underutilized due to inefficient I/O and slow data loading. This mismatch between computation and storage leads to wasted GPU resources, low performance, and high cloud storage costs. The rise of generative AI and GPU scarcity is only making this problem worse.
In this session, Lu will discuss strategies for maximizing GPU utilization by using the open-source stack of PyTorch+Alluxio+S3.
Lu Qiu will discuss the following:
- The challenges of I/O stalls leading to low GPU utilization for model training
- The reference architecture for running PyTorch jobs with Alluxio on EKS while reading data from S3, with benchmark results of training ResNet50 and BERT
- How to use TensorBoard to identify bottlenecks in GPU utilization
No previous knowledge expected
Lu Qiu is a machine learning engineer at Alluxio and is a PMC maintainer of the open source project Alluxio. Lu develops big data solutions for AI/ML training. Before that, Lu was responsible for core Alluxio components including leader election, journal management, and metrics management. Lu receives an M.S. degree from George Washington University in Data Science.