Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

07/13/2022
by   Yookyung Shin, et al.
0

Expressive text-to-speech has shown improved performance in recent years. However, the style control of synthetic speech is often restricted to discrete emotion categories and requires training data recorded by the target speaker in the target style. In many practical situations, users may not have reference speech recorded in target emotion but still be interested in controlling speech style just by typing text description of desired emotional style. In this paper, we propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS. We propose the bi-modal style encoder which models the semantic relationship between text description embedding and speech style embedding with a pretrained language model. To further improve cross-speaker style transfer on disjoint, multi-style datasets, we propose the novel style loss. The experimental results show that our model can generate high-quality expressive speech even in unseen style.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/24/2022

Disentangling Style and Speaker Attributes for TTS Style Transfer

End-to-end neural TTS has shown improved performance in speech style tra...
research
03/15/2023

Cross-speaker Emotion Transfer by Manipulating Speech Style Latents

In recent years, emotional text-to-speech has shown considerable progres...
research
11/07/2021

Emotional Prosody Control for Speech Generation

Machine-generated speech is characterized by its limited or unnatural em...
research
11/19/2022

Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling

This paper aims to synthesize target speaker's speech with desired speak...
research
01/31/2023

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Expressive text-to-speech (TTS) aims to synthesize different speaking st...
research
06/08/2019

Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Recent work has explored sequence-to-sequence latent variable models for...
research
05/17/2023

Using a Large Language Model to Control Speaking Style for Expressive TTS

Appropriate prosody is critical for successful spoken communication. Con...

Please sign up or login with your details

Forgot password? Click here to reset