Controllable Context-aware Conversational Speech Synthesis

06/21/2021
by   Jian Cong, et al.
0

In spoken conversations, spontaneous behaviors like filled pause and prolongations always happen. Conversational partner tends to align features of their speech with their interlocutor which is known as entrainment. To produce human-like conversations, we propose a unified controllable spontaneous conversational speech synthesis framework to model the above two phenomena. Specifically, we use explicit labels to represent two typical spontaneous behaviors filled-pause and prolongation in the acoustic model and develop a neural network based predictor to predict the occurrences of the two behaviors from text. We subsequently develop an algorithm based on the predictor to control the occurrence frequency of the behaviors, making the synthesized speech vary from less disfluent to more disfluent. To model the speech entrainment at acoustic level, we utilize a context acoustic encoder to extract a global style embedding from the previous speech conditioning on the synthesizing of current speech. Furthermore, since the current and previous utterances belong to the different speakers in a conversation, we add a domain adversarial training module to eliminate the speaker-related information in the acoustic encoder while maintaining the style-related information. Experiments show that our proposed approach can synthesize realistic conversations and control the occurrences of the spontaneous behaviors naturally.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/31/2023

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

The spontaneous behavior that often occurs in conversations makes speech...
research
06/11/2021

Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis

For conversational text-to-speech (TTS) systems, it is vital that the sy...
research
04/23/2018

A Discriminative Acoustic-Prosodic Approach for Measuring Local Entrainment

Acoustic-prosodic entrainment describes the tendency of humans to align ...
research
05/21/2020

Conversational End-to-End TTS for Voice Agent

End-to-end neural TTS has achieved superior performance on reading style...
research
05/03/2023

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Conversational text-to-speech (TTS) aims to synthesize speech with prope...
research
04/23/2018

Towards an Unsupervised Entrainment Distance in Conversational Speech using Deep Neural Networks

Entrainment is a known adaptation mechanism that causes interaction part...
research
07/06/2021

AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style

While recent text to speech (TTS) models perform very well in synthesizi...

Please sign up or login with your details

Forgot password? Click here to reset