Insertion-Based Modeling for End-to-End Automatic Speech Recognition

05/27/2020
by   Yuya Fujita, et al.
0

End-to-end (E2E) models have gained attention in the research field of automatic speech recognition (ASR). Many E2E models proposed so far assume left-to-right autoregressive generation of an output token sequence except for connectionist temporal classification (CTC) and its variants. However, left-to-right decoding cannot consider the future output context, and it is not always optimal for ASR. One of the non-left-to-right models is known as non-autoregressive Transformer (NAT) and has been intensively investigated in the area of neural machine translation (NMT) research. One NAT model, mask-predict, has been applied to ASR but the model needs some heuristics or additional component to estimate the length of the output token sequence. This paper proposes to apply another type of NAT called insertion-based models, that were originally proposed for NMT, to ASR tasks. Insertion-based models solve the above mask-predict issues and can generate an arbitrary generation order of an output sequence. In addition, we introduce a new formulation of joint training of the insertion-based models and CTC. This formulation reinforces CTC by making it dependent on insertion-based token generation in a non-autoregressive manner. We conducted experiments on three public benchmarks and achieved competitive performance to strong autoregressive Transformer with a similar decoding condition.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2020

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

We present Mask CTC, a novel non-autoregressive end-to-end automatic spe...
research
05/12/2020

DiscreTalk: Text-to-Speech as a Machine Translation Problem

This paper proposes a new end-to-end text-to-speech (E2E-TTS) model base...
research
05/09/2023

Who Needs Decoders? Efficient Estimation of Sequence-level Attributes

State-of-the-art sequence-to-sequence models often require autoregressiv...
research
05/20/2023

Autoregressive Modeling with Lookahead Attention

To predict the next token, autoregressive models ordinarily examine the ...
research
12/21/2022

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

The network architecture of end-to-end (E2E) automatic speech recognitio...
research
05/16/2020

Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition

Non-autoregressive transformer models have achieved extremely fast infer...
research
01/23/2020

Semi-Autoregressive Training Improves Mask-Predict Decoding

The recently proposed mask-predict decoding algorithm has narrowed the p...

Please sign up or login with your details

Forgot password? Click here to reset