Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

01/05/2023
by   Chengyi Wang, et al.
4

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2023

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speake...
research
06/25/2022

Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations

We formulated non-speech vocalization (NSV) modeling as a text-to-speech...
research
06/28/2022

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

This paper proposes a new "decompose-and-edit" paradigm for the text-bas...
research
07/14/2023

Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

Zero-shot text-to-speech aims at synthesizing voices with unseen speech ...
research
06/06/2023

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Scaling text-to-speech to a large and wild dataset has been proven to be...
research
08/09/2020

Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions

Recent advancements in deep learning led to human-level performance in s...
research
10/27/2022

What Language Model to Train if You Have One Million GPU Hours?

The crystallization of modeling methods around the Transformer architect...

Please sign up or login with your details

Forgot password? Click here to reset