Sample Efficient Adaptive Text-to-Speech

09/27/2018
by   Yutian Chen, et al.
2

We present a meta-learning approach for adaptive text-to-speech (TTS) with few data. During training, we learn a multi-speaker model using a shared conditional WaveNet core and independent learned embeddings for each speaker. The aim of training is not to produce a neural network with fixed weights, which is then deployed as a TTS system. Instead, the aim is to produce a network that requires few data at deployment time to rapidly adapt to new speakers. We introduce and benchmark three strategies: (i) learning the speaker embedding while keeping the WaveNet core fixed, (ii) fine-tuning the entire architecture with stochastic gradient descent, and (iii) predicting the speaker embedding with a trained neural network encoder. The experiments show that these approaches are successful at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/07/2021

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Personalizing a speech synthesis system is a highly desired application,...
research
10/12/2021

Adapting TTS models For New Speakers using Transfer Learning

Training neural text-to-speech (TTS) models for a new speaker typically ...
research
11/01/2022

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

Fine-tuning is a popular method for adapting text-to-speech (TTS) models...
research
02/20/2018

Fitting New Speakers Based on a Short Untranscribed Sample

Learning-based Text To Speech systems have the potential to generalize f...
research
10/28/2022

Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation

Adapting a neural text-to-speech (TTS) model to a target speaker typical...
research
12/13/2018

Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice

Neural TTS has shown it can generate high quality synthesized speech. In...
research
11/04/2019

Supervised online diarization with sample mean loss for multi-domain data

Recently, a fully supervised speaker diarization approach was proposed (...

Please sign up or login with your details

Forgot password? Click here to reset