End-to-end Audio-visual Speech Recognition with Conformers

02/12/2021
by   Pingchuan Ma, et al.
12

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based language model, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2018

End-to-end Audiovisual Speech Recognition

Several end-to-end deep learning approaches have been recently presented...
research
01/04/2023

Audio-Visual Efficient Conformer for Robust Speech Recognition

End-to-end Automatic Speech Recognition (ASR) systems based on neural ne...
research
04/19/2021

Fusing information streams in end-to-end audio-visual speech recognition

End-to-end acoustic speech recognition has quickly gained widespread pop...
research
09/12/2017

End-to-End Audiovisual Fusion with LSTMs

Several end-to-end deep learning approaches have been recently presented...
research
11/16/2016

Lip Reading Sentences in the Wild

The goal of this work is to recognise phrases and sentences being spoken...
research
02/26/2022

Visual Speech Recognition for Multiple Languages in the Wild

Visual speech recognition (VSR) aims to recognise the content of speech ...
research
01/20/2017

End-To-End Visual Speech Recognition With LSTMs

Traditional visual speech recognition systems consist of two stages, fea...

Please sign up or login with your details

Forgot password? Click here to reset