Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

Many neural text-to-speech architectures can synthesize nearly natural speech from text inputs. These architectures must be trained with tens of hours of annotated and high-quality speech data. Compiling such large databases for every new voice requires a lot of time and effort. In this paper, we describe a method to extend the popular Tacotron-2 architecture and its training with data augmentation to enable single-speaker synthesis using a limited amount of specific training data. In contrast to elaborate augmentation methods proposed in the literature, we use simple stationary noises for data augmentation. Our extension is easy to implement and adds almost no computational overhead during training and inference. Using only two hours of training data, our approach was rated by human listeners to be on par with the baseline Tacotron-2 trained with 23.5 hours of LJSpeech data. In addition, we tested our model with a semantically unpredictable sentences test, which showed that both models exhibit similar intelligibility levels.


Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation

Data augmentation via voice conversion (VC) has been successfully applie...

TTS-by-TTS: TTS-driven Data Augmentation for Fast and High-Quality Speech Synthesis

In this paper, we propose a text-to-speech (TTS)-driven data augmentatio...

Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective

Developing speech technologies is a challenge for low-resource languages...

Exploring Voice Conversion based Data Augmentation in Text-Dependent Speaker Verification

In this paper, we focus on improving the performance of the text-depende...

Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks

Online social media is rife with offensive and hateful comments, prompti...

De-STT: De-entaglement of unwanted Nuisances and Biases in Speech to Text System using Adversarial Forgetting

Training a robust Speech to Text (STT) system requires tens of thousands...

Unit selection synthesis based data augmentation for fixed phrase speaker verification

Data augmentation is commonly used to help build a robust speaker verifi...

Please sign up or login with your details

Forgot password? Click here to reset