Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

by   Jingyu Sun, et al.

Conformer has shown a great success in automatic speech recognition (ASR) on many public benchmarks. One of its crucial drawbacks is the quadratic time-space complexity with respect to the input sequence length, which prohibits the model to scale-up as well as process longer input audio sequences. To solve this issue, numerous linear attention methods have been proposed. However, these methods often have limited performance on ASR as they treat tokens equally in modeling, neglecting the fact that the neighbouring tokens are often more connected than the distanced tokens. In this paper, we take this fact into account and propose a new locality-biased linear attention for Conformer. It not only achieves higher accuracy than the vanilla Conformer, but also enjoys linear space-time computational complexity. To be specific, we replace the softmax attention with a locality-biased linear attention (LBLA) mechanism in Conformer blocks. The LBLA contains a kernel function to ensure the linear complexities and a cosine reweighing matrix to impose more weights on neighbouring tokens. Extensive experiments on the LibriSpeech corpus show that by introducing this locality bias to the Conformer, our method achieves a lower word error rate with more than 22


page 1

page 2

page 3

page 4


Syllable Subword Tokens for Open Vocabulary Speech Recognition in Malayalam

In a hybrid automatic speech recognition (ASR) system, a pronunciation l...

Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition

Recent studies have shown that using an external Language Model (LM) ben...

Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions

Recent research using pre-trained transformer models suggests that just ...

cosFormer: Rethinking Softmax in Attention

Transformer has shown great successes in natural language processing, co...

CopyNE: Better Contextual ASR by Copying Named Entities

Recent years have seen remarkable progress in automatic speech recogniti...

Accelerating Transducers through Adjacent Token Merging

Recent end-to-end automatic speech recognition (ASR) systems often utili...

Please sign up or login with your details

Forgot password? Click here to reset