AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines

10/22/2020
by   Yao Shi, et al.
0

In this paper, we present AISHELL-3, a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers. Their auxiliary attributes such as gender, age group and native accents are explicitly marked and provided in the corpus. Accordingly, transcripts in Chinese character-level and pinyin-level are provided along with the recordings. We present a baseline system that uses AISHELL-3 for multi-speaker Madarin speech synthesis. The multi-speaker speech synthesis system is an extension on Tacotron-2 where a speaker verification model and a corresponding loss regarding voice similarity are incorporated as the feedback constraint. We aim to use the presented corpus to build a robust synthesis model that is able to achieve zero-shot voice cloning. The system trained on this dataset also generalizes well on speakers that are never seen in the training process. Objective evaluation results from our experiments show that the proposed multi-speaker synthesis system achieves high voice similarity concerning both speaker embedding similarity and equal error rate measurement. The dataset, baseline system code and generated samples are available online.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2020

Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion

Recent state-of-the-art neural text-to-speech (TTS) synthesis models hav...
research
06/03/2021

An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis

Multi-speaker spoken datasets enable the creation of text-to-speech synt...
research
11/19/2020

TaL: a synchronised multi-speaker corpus of ultrasound tongue imaging, audio, and lip videos

We present the Tongue and Lips corpus (TaL), a multi-speaker corpus of a...
research
05/10/2020

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

High-fidelity speech can be synthesized by end-to-end text-to-speech mod...
research
02/22/2022

nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech

Multi-speaker text-to-speech (TTS) using a few adaption data is a challe...
research
04/24/2018

Perceptual Evaluation of the Effectiveness of Voice Disguise by Age Modification

Voice disguise, purposeful modification of one's speaker identity with t...
research
11/10/2020

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

We explore pretraining strategies including choice of base corpus with t...

Please sign up or login with your details

Forgot password? Click here to reset