[Collaborative Research]
MultiTalk: Enhancing 3D Talking Head Generation Across Languages
with Multilingual Video Dataset
Kim Sung-Bin, Chae-Yeon Lee, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, Tae-Hyun Oh
Abstract
Recent studies in speech-driven 3D talking head generation have
achieved convincing results in verbal articulations. However,
generating accurate lip-syncs degrades when applied to input
speech in other languages, possibly due to the lack of datasets
covering a broad spectrum of facial movements across languages.
In this work, we introduce a novel task to generate 3D talking
heads from speeches of diverse languages. We collect a new mul
tilingual 2D video dataset comprising over 420 hours of talking
videos in 20 languages. Utilizing this dataset, we present a base
line model that incorporates language-specific style embeddings,
enabling it to capture the unique mouth movements associated
with each language. Additionally, we present a metric for assess
ing lip-sync accuracy in multilingual settings. We demonstrate
that training a 3D talking head model with our proposed dataset
significantly enhances its multilingual performance.
Index Terms: Speech-driven 3D talking head, Video dataset,
Multilingual, Audio-visual speech recognition