Discriminative Multi-modality Speech Recognition

05/12/2020
by   Bo Xu, et al.
10

Vision is often used as a complementary modality for audio speech recognition (ASR), especially in the noisy environment where performance of solo audio modality significantly deteriorates. After combining visual modality, ASR is upgraded to the multi-modality speech recognition (MSR). In this paper, we propose a two-stage speech recognition model. In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model understands clearly. At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate. There are some other key contributions: we introduce a pseudo-3D residual convolution (P3D)-based visual front-end to extract more discriminative features; we upgrade the temporal convolution block from 1D ResNet with the temporal convolutional network (TCN), which is more suitable for the temporal tasks; the MSR sub-network is built on the top of Element-wise-Attention Gated Recurrent Unit (EleAtt-GRU), which is more effective than Transformer in long sequences. We conducted extensive experiments on the LRS3-TED and the LRW datasets. Our two-stage model (audio enhanced multi-modality speech recognition, AE-MSR) consistently achieves the state-of-the-art performance by a significant margin, which demonstrates the necessity and effectiveness of AE-MSR.

READ FULL TEXT

page 3

page 11

page 12

research
01/25/2022

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) extends the speech re...
research
11/13/2018

Modality Attention for End-to-End Audio-visual Speech Recognition

Audio-visual speech recognition (AVSR) system is thought to be one of th...
research
06/15/2022

AVATAR: Unconstrained Audiovisual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) is an extension of AS...
research
12/14/2020

AV Taris: Online Audio-Visual Speech Recognition

In recent years, Automatic Speech Recognition (ASR) technology has appro...
research
06/10/2023

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment

Speech Recognition builds a bridge between the multimedia streaming (aud...
research
05/19/2020

Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition

The audio-visual speech fusion strategy AV Align has shown significant p...
research
10/19/2017

Combining Multiple Views for Visual Speech Recognition

Visual speech recognition is a challenging research problem with a parti...

Please sign up or login with your details

Forgot password? Click here to reset