Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

02/26/2022
by   Samuel Thomas, et al.
0

Compared to hybrid automatic speech recognition (ASR) systems that use a modular architecture in which each component can be independently adapted to a new domain, recent end-to-end (E2E) ASR system are harder to customize due to their all-neural monolithic construction. In this paper, we propose a novel text representation and training framework for E2E ASR models. With this approach, we show that a trained RNN Transducer (RNN-T) model's internal LM component can be effectively adapted with text-only data. An RNN-T model trained using both speech and text inputs improves over a baseline model trained on just speech with close to 13 Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation. The usefulness of the proposed approach is further demonstrated by customizing this general purpose RNN-T model to three separate datasets. We observe 20-45 relative word error rate (WER) reduction in these settings with this novel LM style customization technique using only unpaired text data from the new domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/26/2020

A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition

This article describes a density ratio approach to integrating external ...
research
07/31/2020

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model

End-to-end (E2E) systems have played a more and more important role in a...
research
11/04/2022

Multi-blank Transducers for Speech Recognition

This paper proposes a modification to RNN-Transducer (RNN-T) models for ...
research
04/22/2021

Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network

Adaption of end-to-end speech recognition systems to new tasks is known ...
research
04/27/2021

On Addressing Practical Challenges for RNN-Transducer

In this paper, several works are proposed to address practical challenge...
research
10/13/2022

JOIST: A Joint Speech and Text Streaming Model For ASR

We present JOIST, an algorithm to train a streaming, cascaded, encoder e...
research
06/02/2020

An ASR Guided Speech Intelligibility Measure for TTS Model Selection

The perceptual quality of neural text-to-speech (TTS) is highly dependen...

Please sign up or login with your details

Forgot password? Click here to reset