Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis

06/03/2021
by   Beata Lorincz, et al.
0

Building multispeaker neural network-based text-to-speech synthesis systems commonly relies on the availability of large amounts of high quality recordings from each speaker and conditioning the training process on the speaker's identity or on a learned representation of it. However, when little data is available from each speaker, or the number of speakers is limited, the multispeaker TTS can be hard to train and will result in poor speaker similarity and naturalness. In order to address this issue, we explore two directions: forcing the network to learn a better speaker identity representation by appending an additional loss term; and augmenting the input data pertaining to each speaker using waveform manipulation methods. We show that both methods are efficient when evaluated with both objective and subjective measures. The additional loss term aids the speaker similarity, while the data augmentation improves the intelligibility of the multispeaker TTS system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/12/2020

Data augmentation enhanced speaker enrollment for text-dependent speaker verification

Data augmentation is commonly used for generating additional data from t...
research
07/19/2019

DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis

This paper proposes novel algorithms for speaker embedding using subject...
research
03/20/2022

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

In recent years, neural network based methods for multi-speaker text-to-...
research
02/10/2022

Cross-speaker style transfer for text-to-speech using data augmentation

We address the problem of cross-speaker style transfer for text-to-speec...
research
02/19/2021

Unit selection synthesis based data augmentation for fixed phrase speaker verification

Data augmentation is commonly used to help build a robust speaker verifi...
research
08/07/2020

Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes

Multi-speaker speech synthesis is a technique for modeling multiple spea...
research
07/07/2020

X-vectors: New Quantitative Biomarkers for Early Parkinson's Disease Detection from Speech

Many articles have used voice analysis to detect Parkinson's disease (PD...

Please sign up or login with your details

Forgot password? Click here to reset