Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT

by   Ye Bai, et al.

Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech utterance, which has the token-level relationship implicitly, we can predict a token without explicitly autoregressive language modeling. When the prediction of a token does not rely on other tokens, the parallel prediction of all tokens in the sequence is realizable. Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once). The model consists of an encoder, a decoder, and a position dependent summarizer (PDS). The three modules are based on basic attention blocks. The encoder extracts high-level representations from the speech. The PDS uses positional encodings corresponding to tokens to convert the acoustic representations into token-level representations. The decoder further captures token-level relationships with the self-attention mechanism. At last, the probability distribution on the vocabulary is computed for each token position. Therefore, speech recognition is re-formulated as a position-wise classification problem. Further, we propose a cross-modal transfer learning method to refine semantics from a large-scale pre-trained language model BERT for improving the performance.


page 1

page 2

page 3

page 11


A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Recently, end-to-end models have been widely used in automatic speech re...

Unimodal Aggregation for CTC-based Speech Recognition

This paper works on non-autoregressive automatic speech recognition. A u...

An Improved Single Step Non-autoregressive Transformer for Automatic Speech Recognition

Non-autoregressive mechanisms can significantly decrease inference time ...

Personalization of CTC Speech Recognition Models

End-to-end speech recognition models trained using joint Connectionist T...

Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR

Leveraging context information is an intuitive idea to improve performan...

TLDR: Token Loss Dynamic Reweighting for Reducing Repetitive Utterance Generation

Natural Language Generation (NLG) models are prone to generating repetit...

Please sign up or login with your details

Forgot password? Click here to reset