Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

09/09/2023
by   Huaibo Zhao, et al.
0

Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. We also discuss the effect of the proposed pre-training method on obtaining accurate output spike timing.

READ FULL TEXT
research
10/20/2021

An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR

In the present paper, an attempt is made to combine Mask-CTC and the tri...
research
11/02/2022

Conversation-oriented ASR with multi-look-ahead CBS architecture

During conversations, humans are capable of inferring the intention of t...
research
06/17/2021

Multi-mode Transformer Transducer with Stochastic Future Context

Automatic speech recognition (ASR) models make fewer errors when more su...
research
07/06/2022

Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

There is often a trade-off between performance and latency in streaming ...
research
06/18/2023

SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition

The Streaming Unmixing and Recognition Transducer (SURT) model was propo...
research
11/09/2020

Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR

In this work, to measure the accuracy and efficiency for a latency-contr...
research
04/24/2023

Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

This paper proposes a self-regularised minimum latency training (SR-MLT)...

Please sign up or login with your details

Forgot password? Click here to reset