Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

03/24/2018
by   RJ Skerry-Ryan, et al.
0

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

READ FULL TEXT
research
03/07/2023

Do Prosody Transfer Models Transfer Prosody?

Some recent models for Text-to-Speech synthesis aim to transfer the pros...
research
05/10/2020

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

High-fidelity speech can be synthesized by end-to-end text-to-speech mod...
research
11/07/2021

Speaker Generation

This work explores the task of synthesizing speech in nonexistent human-...
research
10/09/2021

Using multiple reference audios and style embedding constraints for speech synthesis

The end-to-end speech synthesis model can directly take an utterance as ...
research
11/21/2019

Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features

This paper presents a simple yet effective method to achieve prosody tra...
research
05/31/2023

DC CoMix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer

Despite the huge successes made in neutral TTS, content-leakage remains ...
research
03/06/2021

Analysis and Assessment of Controllability of an Expressive Deep Learning-based TTS system

In this paper, we study the controllability of an Expressive TTS system ...

Please sign up or login with your details

Forgot password? Click here to reset