Abstract
With the rapid development of large language models (LLMs) in natural language processing, speech LLMs have also begun to receive increasing attention. In this talk, we will introduce VALL‑E, a zero‑shot text‑to‑speech (TTS) synthesis approach built upon large language models. Leveraging the in‑context learning capabilities of LLMs, VALL‑E can generate high‑quality, personalized speech using only a three‑second audio prompt from an unseen speaker. Building upon this foundation, we will further introduce several extensions of VALL‑E, including: VALL‑E X (the multilingual version), VALL‑E 2(addressing stability issues), PALLE(combining AR and NAR modelling), MELL‑E and FELLE (based on continuous speech representations).
About the speaker
Shujie Liu is a Principal Researcher at MSRA Hong Kong. His research focuses on natural language processing, speech processing, and machine learning. He has published over 100 papers in top-tier conferences and journals in NLP and speech, co‑authored the book Machine Translation, and contributed to Introduction to Artificial Intelligence. He has won multiple first‑place awards in international NLP and speech evaluation campaigns and has served as a reviewer and area chair for several major conferences. His research has been widely deployed in Microsoft products, including Microsoft Translator, Skype Translator, Microsoft IME, and Microsoft Speech Services.
