UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

01/10/2023
by   Haogeng Liu, et al.
0

Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content information, speaker information, prosody information. Both TTS and VC can be regarded as mining these three parts of information from the input and completing the reconstruction of speech. For TTS, the speech content information is derived from the text, while in VC it's derived from the source speech, so all the remaining units are shared except for the speech content extraction module in the two tasks. We applied vector quantization and domain constrain to bridge the gap between the content domains of TTS and VC. Objective and subjective evaluation shows that by combining the two task, TTS obtains better speaker modeling ability while VC gets hold of impressive speech content decoupling capability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2021

SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines

Nowadays, as more and more systems achieve good performance in tradition...
research
08/21/2023

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Voice conversion as the style transfer task applied to speech, refers to...
research
02/21/2022

AVQVC: One-shot Voice Conversion by Vector Quantization with applying contrastive learning

Voice Conversion(VC) refers to changing the timbre of a speech while ret...
research
08/08/2022

TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and Adversarial Training

Non-parallel many-to-many voice conversion remains an interesting but ch...
research
04/23/2020

Unsupervised Speech Decomposition via Triple Information Bottleneck

Speech information can be roughly decomposed into four components: langu...
research
09/12/2021

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Given a piece of speech and its transcript text, text-based speech editi...
research
02/18/2022

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Though significant progress has been made for speaker-dependent Video-to...

Please sign up or login with your details

Forgot password? Click here to reset