Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

01/30/2023
by   Takaaki Saeki, et al.
0

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12 using public datasets and the implementation will be made available for reproducibility.

READ FULL TEXT
research
02/28/2022

Cross-Lingual Text Classification with Multilingual Distillation and Zero-Shot-Aware Training

Multilingual pre-trained language models (MPLMs) not only can handle tas...
research
10/27/2022

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

This paper proposes Virtuoso, a massively multilingual speech-text joint...
research
06/29/2023

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic ...
research
05/25/2023

Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration

This work aims to build a multilingual text-to-speech (TTS) synthesis sy...
research
03/05/2021

Multilingual Byte2Speech Text-To-Speech Models Are Few-shot Spoken Language Learners

We present a multilingual end-to-end Text-To-Speech framework that maps ...
research
05/24/2022

Adaptive multilingual speech recognition with pretrained models

Multilingual speech recognition with supervised learning has achieved gr...
research
06/03/2021

How to Adapt Your Pretrained Multilingual Model to 1600 Languages

Pretrained multilingual models (PMMs) enable zero-shot learning via cros...

Please sign up or login with your details

Forgot password? Click here to reset