Streaming Audio-Visual Speech Recognition with Alignment Regularization

11/03/2022
by   Pingchuan Ma, et al.
0

Recognizing a word shortly after it is spoken is an important requirement for automatic speech recognition (ASR) systems in real-world scenarios. As a result, a large body of work on streaming audio-only ASR models has been presented in the literature. However, streaming audio-visual automatic speech recognition (AV-ASR) has received little attention in earlier works. In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer architecture, which is made streamable using chunk-wise self-attention (CSA) and causal convolution. Streaming recognition with a decoder neural network is realized by using the triggered attention technique, which performs time-synchronous decoding with joint CTC/attention scoring. For frame-level ASR criteria, such as CTC, a synchronized response from the audio and visual encoders is critical for a joint AV decision making process. In this work, we propose a novel alignment regularization technique that promotes synchronization of the audio and visual encoder, which in turn results in better word error rates (WERs) at all SNR levels for streaming and offline AV-ASR models. The proposed AV-ASR model achieves WERs of 2.0 Lip Reading Sentences 3 (LRS3) dataset in an offline and online setup, respectively, which both present state-of-the-art results when no external training data are used.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/08/2020

Streaming automatic speech recognition with the transformer model

Encoder-decoder based sequence-to-sequence models have demonstrated stat...
research
05/21/2020

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

Recently, streaming end-to-end automatic speech recognition (E2E-ASR) ha...
research
11/29/2022

Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

The neural transducer is an end-to-end model for automatic speech recogn...
research
01/04/2023

Audio-Visual Efficient Conformer for Robust Speech Recognition

End-to-end Automatic Speech Recognition (ASR) systems based on neural ne...
research
06/20/2023

Timestamped Embedding-Matching Acoustic-to-Word CTC ASR

In this work, we describe a novel method of training an embedding-matchi...
research
07/22/2022

ASR Error Detection via Audio-Transcript entailment

Despite improved performances of the latest Automatic Speech Recognition...
research
12/22/2022

Alignment Entropy Regularization

Existing training criteria in automatic speech recognition(ASR) permit t...

Please sign up or login with your details

Forgot password? Click here to reset