A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

02/08/2023
by   Li-Wei Chen, et al.
0

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.

READ FULL TEXT

page 3

page 6

research
09/11/2021

Incorporating Real-world Noisy Speech in Neural-network-based Speech Enhancement Systems

Supervised speech enhancement relies on parallel databases of degraded s...
research
12/07/2020

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture

In this work, we address the Text-to-Speech (TTS) task by proposing a no...
research
10/26/2022

Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection

This paper proposes a method for selecting training data for text-to-spe...
research
05/31/2023

Text-to-Speech Pipeline for Swiss German – A comparison

In this work, we studied the synthesis of Swiss German speech using diff...
research
07/04/2022

Mix and Match: An Empirical Study on Training Corpus Composition for Polyglot Text-To-Speech (TTS)

Training multilingual Neural Text-To-Speech (NTTS) models using only mon...
research
12/02/2019

Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection

Recent advances in Text-to-Speech (TTS) have improved quality and natura...
research
04/28/2022

Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss

Recent deep learning Text-to-Speech (TTS) systems have achieved impressi...

Please sign up or login with your details

Forgot password? Click here to reset