Multi-blank Transducers for Speech Recognition

11/04/2022
by   Hainan Xu, et al.
0

This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90 Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (<https://github.com/NVIDIA/NeMo>) toolkit.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/19/2022

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

The two most popular loss functions for streaming end-to-end automatic s...
research
02/26/2022

Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

Compared to hybrid automatic speech recognition (ASR) systems that use a...
research
12/11/2020

Improved Robustness to Disfluencies in RNN-Transducer Based Speech Recognition

Automatic Speech Recognition (ASR) based on Recurrent Neural Network Tra...
research
07/26/2023

Say Goodbye to RNN-T Loss: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition

RNN-T models are widely used in ASR, which rely on the RNN-T loss to ach...
research
07/29/2022

Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks, ...
research
05/29/2023

CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice

Despite the recent advancements in Automatic Speech Recognition (ASR), t...
research
12/21/2022

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

The network architecture of end-to-end (E2E) automatic speech recognitio...

Please sign up or login with your details

Forgot password? Click here to reset