Fine-grained style modelling and transfer in text-to-speech synthesis via content-style disentanglement

11/08/2020
by   Tan Daxin, et al.
0

This paper presents a novel neural model for fine-grained style modeling and transfer in expressive text-to-speech (TTS) synthesis. By applying collaborative learning and adversarial learning strategies with thoughtfully designed loss functions, the proposed model is able to perform effective phoneme-level disentanglement of content factor and style factor of speech. Speech style transfer can be achieved by combining the style embedding extracted from a reference utterance with the phoneme embedding derived from the source text. Results of objective evaluation show that the synthesized speech preserves the intended content and carries similar prosody to the reference speech. Results of subjective evaluation show that the new model performs better than other fine-grained style transfer TTS models.

READ FULL TEXT
research
07/27/2021

Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis

Cross-speaker style transfer is crucial to the applications of multi-sty...
research
10/12/2021

Fine-grained style control in Transformer-based Text-to-speech Synthesis

In this paper, we present a novel architecture to realize fine-grained s...
research
08/04/2021

Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Expressive neural text-to-speech (TTS) systems incorporate a style encod...
research
03/14/2023

Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

Cross-speaker style transfer in speech synthesis aims at transferring a ...
research
08/04/2020

Expressive TTS Training with Frame and Style Reconstruction Loss

We propose a novel training strategy for Tacotron-based text-to-speech (...
research
07/11/2022

PoeticTTS – Controllable Poetry Reading for Literary Studies

Speech synthesis for poetry is challenging due to specific intonation pa...
research
04/11/2022

Fine-grained Noise Control for Multispeaker Speech Synthesis

A text-to-speech (TTS) model typically factorizes speech attributes such...

Please sign up or login with your details

Forgot password? Click here to reset