Self supervised learning for robust voice cloning

04/07/2022
by   Konstantinos Klapsas, et al.
0

Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2020

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

We present a novel approach to any-to-one (A2O) voice conversion (VC) in...
research
12/08/2021

Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features

Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker ...
research
10/23/2022

Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

Single channel target speaker separation (TSS) aims at extracting a spea...
research
07/05/2022

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

The zero-shot scenario for speech generation aims at synthesizing a nove...
research
03/31/2022

WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

Recent advances in neural text-to-speech research have been dominated by...
research
03/01/2021

AdaSpeech: Adaptive Text to Speech for Custom Voice

Custom voice, a specific text to speech (TTS) service in commercial spee...
research
03/18/2022

AdaVocoder: Adaptive Vocoder for Custom Voice

Custom voice is to construct a personal speech synthesis system by adapt...

Please sign up or login with your details

Forgot password? Click here to reset