| December 18, 2025 |
-
Title: Towards Consistent and Physically Plausible Visual Generation
Time: 11:00am
Venue: CB 308
Speaker(s): Departmental Seminar by Prof. Jianfei Cai, Monash University
Remark(s): Abstract
Recent advances in large language models (LLMs) and multimodal large language models (MLLMs) have significantly enhanced the understanding and encoding of textual information. Leveraging these capabilities, a growing number of diffusion-based generative models have emerged for text-conditioned visual generation — spanning text-to-image, text-to-video, and text-to-3D tasks. While these models offer remarkable flexibility and produce increasingly realistic content, they still face fundamental challenges: aligning precisely with user intent, maintaining spatial, view, and temporal consistency, and adhering to the laws of physics. In this talk, I will present several recent research projects from my group that attacks these challenges. PanFusion enforces global consistency in text-to-panorama image generation; MVSplat360 uses image conditions and explicit 3D representation to enhance view consistency of 3D generation. VLIPP integrates physics-informed priors to ensure physically plausible text-to-video generation. I will conclude by pointing out the limitations and discussing future directions such as developing world models.
About the speaker
Jianfei Cai is a Professor at Faculty of IT, Monash University, where he had served as the inaugural Head for the Data Science & AI Department. Before that, he was Head of Visual and Interactive Computing Division and Head of Computer Communications Division in Nanyang Technological University (NTU). His major research interests include computer vision, deep learning and multimedia. He is a co-recipient of paper awards in ACCV, ICCM, IEEE ICIP and MMSP, and a winner of Monash FIT’s Dean's Researcher of the Year Award and Monash FIT Dean's Award for Excellence in Graduate Research Supervision. He serves or has served as an Associate Editor for TPAMI, IJCV, IEEE T-IP, T-MM, and T-CSVT as well as serving as Senior/Area Chair for CVPR, ICCV, ECCV, ACM Multimedia, ICLR and IJCAI. He was the Chair of IEEE CAS VSPC-TC during 2016-2018. He had served as the leading TPC Chair for IEEE ICME 2012, the best paper award committee chair & co-chair for IEEE T-MM 2020 & 2019, and the leading General Chair for ACM Multimedia 2024. He is a Fellow of IEEE.

|
| December 19, 2025 |
-
Title: Towards Scalable Serverless LLM Inference Systems
Time: 10:30am
Venue: CB 308
Speaker(s): Prof. Minchen Yu
Remark(s): Abstract
Serverless computing has become a compelling cloud paradigm for model inference due to its high usability and elasticity. However, current serverless platforms suffer from significant cold-start overhead---especially for large models---limiting their ability to deliver low-latency, resource-efficient inference. In this talk, I will present three systems we built for scalable serverless inference. First, Torpor proposes node-level GPU pooling that enables fine-grained GPU sharing and fast model swapping. Second, LambdaScale leverages high-speed interconnects to scale models across nodes and performs pipelined inference for lower latency. Third, for emerging large mixture-of-experts (MoE) models, we design fine-grained expert scheduling with elastic scaling to improve the cost-effectiveness of MoE inference.
About the speaker
Minchen Yu is an Assistant Professor at the School of Data Science, The Chinese University of Hong Kong, Shenzhen. He received his Ph.D. from Hong Kong University of Science and Technology. His research interests cover cloud computing and distributed systems, with a recent focus on serverless computing and machine learning systems. His research has been published at various prestigious venues, such as NSDI, ATC, EuroSys, INFOCOM, and SoCC, and has been applied in leading cloud platforms, such as Alibaba Cloud. He received the Best Paper Runner-Up Award at IEEE ICDCS 2021.

-
Title: LLM based Zero Shot Speech Synthesis
Time: 02:00pm
Venue: CB 308
Speaker(s): Dr. Shujie Liu
Remark(s): Abstract
With the rapid development of large language models (LLMs) in natural language processing, speech LLMs have also begun to receive increasing attention. In this talk, we will introduce VALL‑E, a zero‑shot text‑to‑speech (TTS) synthesis approach built upon large language models. Leveraging the in‑context learning capabilities of LLMs, VALL‑E can generate high‑quality, personalized speech using only a three‑second audio prompt from an unseen speaker. Building upon this foundation, we will further introduce several extensions of VALL‑E, including: VALL‑E X (the multilingual version), VALL‑E 2(addressing stability issues), PALLE(combining AR and NAR modelling), MELL‑E and FELLE (based on continuous speech representations).
About the speaker
Shujie Liu is a Principal Researcher at MSRA Hong Kong. His research focuses on natural language processing, speech processing, and machine learning. He has published over 100 papers in top-tier conferences and journals in NLP and speech, co‑authored the book Machine Translation, and contributed to Introduction to Artificial Intelligence. He has won multiple first‑place awards in international NLP and speech evaluation campaigns and has served as a reviewer and area chair for several major conferences. His research has been widely deployed in Microsoft products, including Microsoft Translator, Skype Translator, Microsoft IME, and Microsoft Speech Services.

|