[Collaborative Research]

MultiTalk: Enhancing 3D Talking Head Generation Across Languages
with Multilingual Video Dataset

Kim Sung-Bin, Chae-Yeon Lee, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, Tae-Hyun Oh

Abstract

Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new mul tilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. Utilizing this dataset, we present a base line model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assess ing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance. Index Terms: Speech-driven 3D talking head, Video dataset, Multilingual, Audio-visual speech recognition
Interspeech 2024