Robust Self-Supervised Audio-Visual Speech Recognition

01/05/2022
by   Bowen Shi, et al.
0

Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by  50 the presence of babble noise, while reducing the WER of an audio-based model by over 75

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/05/2022

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Video recordings of speech contain correlated audio and visual informati...
research
02/28/2023

Practice of the conformer enhanced AUDIO-VISUAL HUBERT on Mandarin and English

Considering the bimodal nature of human speech perception, lips, and tee...
research
06/15/2022

AVATAR: Unconstrained Audiovisual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) is an extension of AS...
research
06/14/2021

Learning Audio-Visual Dereverberation

Reverberation from audio reflecting off surfaces and objects in the envi...
research
11/12/2020

Self-supervised reinforcement learning for speaker localisation with the iCub humanoid robot

In the future robots will interact more and more with humans and will ha...
research
08/11/2023

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

Visual Speech Recognition (VSR) differs from the common perception tasks...
research
09/30/2021

SpliceOut: A Simple and Efficient Audio Augmentation Method

Time masking has become a de facto augmentation technique for speech and...

Please sign up or login with your details

Forgot password? Click here to reset