Self-regularised Minimum Latency Training for Streaming Transformer-based Speech Recognition

04/24/2023
by   Mohan Li, et al.
0

This paper proposes a self-regularised minimum latency training (SR-MLT) method for streaming Transformer-based automatic speech recognition (ASR) systems. In previous works, latency was optimised by truncating the online attention weights based on the hard alignments obtained from conventional ASR models, without taking into account the potential loss of ASR accuracy. On the contrary, here we present a strategy to obtain the alignments as a part of the model training without external supervision. The alignments produced by the proposed method are dynamically regularised on the training data, such that the latency reduction does not result in the loss of ASR accuracy. SR-MLT is applied as a fine-tuning step on the pre-trained Transformer models that are based on either monotonic chunkwise attention (MoChA) or cumulative attention (CA) algorithms for online decoding. ASR experiments on the AIShell-1 and Librispeech datasets show that when applied on a decent pre-trained MoChA or CA baseline model, SR-MLT can effectively reduce the latency with the relative gains ranging from 11.8 certain accuracy levels, the models trained with SR-MLT can achieve lower latency when compared to those supervised using external hard alignments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/11/2022

Transformer-based Streaming ASR with Cumulative Attention

In this paper, we propose an online attention mechanism, known as cumula...
research
03/17/2021

Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

End-to-end automatic speech recognition (ASR), unlike conventional ASR, ...
research
04/11/2023

Explicit and Implicit Semantic Ranking Framework

The core challenge in numerous real-world applications is to match an in...
research
01/15/2020

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Recently, Transformer has gained success in automatic speech recognition...
research
09/09/2023

Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Achieving high accuracy with low latency has always been a challenge in ...
research
07/10/2020

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition

Recently, attention-based encoder-decoder (AED) models have shown state-...
research
10/21/2020

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

This paper proposes an efficient memory transformer Emformer for low lat...

Please sign up or login with your details

Forgot password? Click here to reset