Abstract
Large language models (LLMs) are profoundly reshaping the global economies and technology. Efficient systems for LLM pre-training are essential because they directly impact model quality, operational costs, and environmental sustainability. In this talk, Zhuang will present two system research projects designed to tackle fundamental communication challenges within LLM pre-training. ZEN (OSDI ‘25) addresses data plane challenges by optimizing synchronization strategies for sparse tensor communications. GEMINI (SOSP ’23) focuses on the management plane by redesigning the checkpoint storage system engineered to minimize failure recovery overheads.
About the speaker
Zhuang Wang is a Senior Applied Scientist at Amazon Annapurna Labs. He earned his Ph.D. in Computer Science from Rice University in 2023. His current research interests focus on efficient training and inference systems for large language models. He has published papers as the first author in prestigious venues including OSDI, SOSP, SIGCOMM, and EuroSys. Zhuang has served on the Program Committee for OSDI, ATC, and MLSys.
