CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho

Abstract

With the emergence of neural audio codecs, which encode multiple streams of dis-crete tokens from audio, large language models have recently gained attention asa promising approach for zero-shot Text-to-Speech (TTS) synthesis.  Despite theongoing rush towards scaling paradigms, audio tokenization ironically amplifiesthe scalability challenge, stemming from its long sequence length and the com-plexity of modelling the multiple sequences. To mitigate these issues, we presentCLaM-TTS that employs a probabilistic residual vector quantization to 1) achievesuperior compression in the token length, and 2) allow a language model to gen-erate multiple tokens at once, thereby eliminating the need for cascaded modelingto handle the number of token streams. Our experimental results demonstrate thatCLaM-TTS is better than or comparable to state-of-the-art zero-shot TTS base-lines regarding naturalness, intelligibility, speaker similarity, and inference speed.In addition, we examine the impact of the pretraining extent of the language mod-els and their text tokenization strategies on performances.

ICLR 2024