Using multiple reference audios and style embedding constraints for speech synthesis

10/09/2021
by   Cheng Gong, et al.
0

The end-to-end speech synthesis model can directly take an utterance as reference audio, and generate speech from the text with prosody and speaker characteristics similar to the reference audio. However, an appropriate acoustic embedding must be manually selected during inference. Due to the fact that only the matched text and speech are used in the training process, using unmatched text and speech for inference would cause the model to synthesize speech with low content quality. In this study, we propose to mitigate these two problems by using multiple reference audios and style embedding constraints rather than using only the target audio. Multiple reference audios are automatically selected using the sentence similarity determined by Bidirectional Encoder Representations from Transformers (BERT). In addition, we use ”target” style embedding from a Pre-trained encoder as a constraint by considering the mutual information between the predicted and ”target” style embedding. The experimental results show that the proposed model can improve the speech naturalness and content quality with multiple reference audios and can also outperform the baseline model in ABX preference tests of style similarity.

READ FULL TEXT
research
03/24/2018

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

We present an extension to the Tacotron speech synthesis architecture th...
research
03/09/2020

Unsupervised Style and Content Separation by Minimizing Mutual Information for Speech Synthesis

We present a method to generate speech from input text and a style vecto...
research
12/07/2020

Using previous acoustic context to improve Text-to-Speech synthesis

Many speech synthesis datasets, especially those derived from audiobooks...
research
04/04/2022

Into-TTS : Intonation Template based Prosody Control System

Intonations take an important role in delivering the intention of the sp...
research
08/04/2021

Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Expressive neural text-to-speech (TTS) systems incorporate a style encod...
research
06/08/2021

Speech BERT Embedding For Improving Prosody in Neural TTS

This paper presents a speech BERT model to extract embedded prosody info...
research
08/30/2019

Maximizing Mutual Information for Tacotron

End-to-end speech synthesis method such as Tacotron, Tacotron2 and Trans...

Please sign up or login with your details

Forgot password? Click here to reset