VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

by   Chenpeng Du, et al.

The mainstream neural text-to-speech(TTS) pipeline is a cascade system, including an acoustic model(AM) that predicts acoustic feature from the input transcript and a vocoder that generates waveform according to the given acoustic feature. However, the acoustic feature in current TTS systems is typically mel-spectrogram, which is highly correlated along both time and frequency axes in a complicated way, leading to a great difficulty for the AM to predict. Although high-fidelity audio can be generated by recent neural vocoders from ground-truth(GT) mel-spectrogram, the gap between the GT and the predicted mel-spectrogram from AM degrades the performance of the entire TTS system. In this work, we propose VQTTS, consisting of an AM txt2vec and a vocoder vec2wav, which uses self-supervised vector-quantized(VQ) acoustic feature rather than mel-spectrogram. We redesign both the AM and the vocoder accordingly. In particular, txt2vec basically becomes a classification model instead of a traditional regression model while vec2wav uses an additional feature encoder before HifiGAN generator for smoothing the discontinuous quantized feature. Our experiments show that vec2wav achieves better reconstruction performance than HifiGAN when using self-supervised VQ acoustic feature. Moreover, our entire TTS system VQTTS achieves state-of-the-art performance in terms of naturalness among all current publicly available TTS systems.


page 1

page 2

page 3

page 4


Glow-WaveGAN: Learning Speech Representations from GAN-based Variational Auto-Encoder For High Fidelity Flow-based Speech Synthesis

Current two-stage TTS framework typically integrates an acoustic model w...

WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss

Tacotron-based text-to-speech (TTS) systems directly synthesize speech f...

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

The goal of Automatic Voice Over (AVO) is to generate speech in sync wit...

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

In recent Text-to-Speech (TTS) systems, a neural vocoder often generates...

SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling

We present a self-supervised speech restoration method without paired sp...

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Current text to speech (TTS) systems usually leverage a cascaded acousti...

Mispronunciation Detection and Correction via Discrete Acoustic Units

Computer-Assisted Pronunciation Training (CAPT) plays an important role ...

Please sign up or login with your details

Forgot password? Click here to reset