Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data

by   Naoki Makishima, et al.

In this paper, we investigate the semi-supervised joint training of text to speech (TTS) and automatic speech recognition (ASR), where a small amount of paired data and a large amount of unpaired text data are available. Conventional studies form a cycle called the TTS-ASR pipeline, where the multispeaker TTS model synthesizes speech from text with a reference speech and the ASR model reconstructs the text from the synthesized speech, after which both models are trained with a cycle-consistency loss. However, the synthesized speech does not reflect the speaker characteristics of the reference speech and the synthesized speech becomes overly easy for the ASR model to recognize after training. This not only decreases the TTS model quality but also limits the ASR model improvement. To solve this problem, we propose improving the cycleconsistency-based training with a speaker consistency loss and step-wise optimization. The speaker consistency loss brings the speaker characteristics of the synthesized speech closer to that of the reference speech. In the step-wise optimization, we first freeze the parameter of the TTS model before both models are trained to avoid over-adaptation of the TTS model to the ASR model. Experimental results demonstrate the efficacy of the proposed method.


Cycle-consistency training for end-to-end speech recognition

This paper presents a method to train end-to-end automatic speech recogn...

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

This paper proposes Virtuoso, a massively multilingual speech-text joint...

Semi-supervised acoustic model training for speech with code-switching

In the FAME! project, we aim to develop an automatic speech recognition ...

Semi-supervised acoustic modelling for five-lingual code-switched ASR using automatically-segmented soap opera speech

This paper considers the impact of automatic segmentation on the fully-a...

An ASR Guided Speech Intelligibility Measure for TTS Model Selection

The perceptual quality of neural text-to-speech (TTS) is highly dependen...

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Text-to-speech systems are typically evaluated on single sentences. When...

Improving Semi-supervised End-to-end Automatic Speech Recognition using CycleGAN and Inter-domain Losses

We propose a novel method that combines CycleGAN and inter-domain losses...

Please sign up or login with your details

Forgot password? Click here to reset