Fine-grained style modelling and transfer in text-to-speech synthesis via content-style disentanglement

11/08/2020

∙

This paper presents a novel neural model for fine-grained style modeling and transfer in expressive text-to-speech (TTS) synthesis. By applying collaborative learning and adversarial learning strategies with thoughtfully designed loss functions, the proposed model is able to perform effective phoneme-level disentanglement of content factor and style factor of speech. Speech style transfer can be achieved by combining the style embedding extracted from a reference utterance with the phoneme embedding derived from the source text. Results of objective evaluation show that the synthesized speech preserves the intended content and carries similar prosody to the reference speech. Results of subjective evaluation show that the new model performs better than other fine-grained style transfer TTS models.

READ FULL TEXT

Fine-grained style modelling and transfer in text-to-speech synthesis via content-style disentanglement

Sign in with Google

Consider DeepAI Pro