Towards Scalable Serverless LLM Inference Systems

Abstract

Serverless computing has become a compelling cloud paradigm for model inference due to its high usability and elasticity. However, current serverless platforms suffer from significant cold-start overhead---especially for large models---limiting their ability to deliver low-latency, resource-efficient inference. In this talk, I will present three systems we built for scalable serverless inference. First, Torpor proposes node-level GPU pooling that enables fine-grained GPU sharing and fast model swapping. Second, LambdaScale leverages high-speed interconnects to scale models across nodes and performs pipelined inference for lower latency. Third, for emerging large mixture-of-experts (MoE) models, we design fine-grained expert scheduling with elastic scaling to improve the cost-effectiveness of MoE inference.

About the speaker

Minchen Yu is an Assistant Professor at the School of Data Science, The Chinese University of Hong Kong, Shenzhen. He received his Ph.D. from Hong Kong University of Science and Technology. His research interests cover cloud computing and distributed systems, with a recent focus on serverless computing and machine learning systems. His research has been published at various prestigious venues, such as NSDI, ATC, EuroSys, INFOCOM, and SoCC, and has been applied in leading cloud platforms, such as Alibaba Cloud. He received the Best Paper Runner-Up Award at IEEE ICDCS 2021.

Sign in to your account