AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

by   Yuzi Yan, et al.

While recent text to speech (TTS) models perform very well in synthesizing reading-style (e.g., audiobook) speech, it is still challenging to synthesize spontaneous-style speech (e.g., podcast or conversation), mainly because of two reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous speech. In this paper, we develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech. Specifically, 1) to insert filled pauses (FP) in the text sequence appropriately, we introduce an FP predictor to the TTS model; 2) to model the varying rhythms, we introduce a duration predictor based on mixture of experts (MoE), which contains three experts responsible for the generation of fast, medium and slow speech respectively, and fine-tune it as well as the pitch predictor for rhythm adaptation; 3) to adapt to other speaker timbre, we fine-tune some parameters in the decoder with few speech data. To address the challenge of lack of training data, we mine a spontaneous speech dataset to support our research this work and facilitate future research on spontaneous TTS. Experiments show that AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.


page 1

page 2

page 3

page 4


AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data

Text to speech (TTS) is widely used to synthesize personal voice for a t...

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

With rapid progress in neural text-to-speech (TTS) models, personalized ...

ComedicSpeech: Text To Speech For Stand-up Comedies in Low-Resource Scenarios

Text to Speech (TTS) models can generate natural and high-quality speech...

ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech

Expressive text-to-speech (TTS) has become a hot research topic recently...

AdaSpeech: Adaptive Text to Speech for Custom Voice

Custom voice, a specific text to speech (TTS) service in commercial spee...

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Text-to-Speech (TTS) has recently seen great progress in synthesizing hi...

Controllable Context-aware Conversational Speech Synthesis

In spoken conversations, spontaneous behaviors like filled pause and pro...

Please sign up or login with your details

Forgot password? Click here to reset