Listen, Look and Deliberate: Visual context-aware speech recognition using pre-trained text-video representations

by   Shahram Ghorbani, et al.

In this study, we try to address the problem of leveraging visual signals to improve Automatic Speech Recognition (ASR), also known as visual context-aware ASR (VC-ASR). We explore novel VC-ASR approaches to leverage video and text representations extracted by a self-supervised pre-trained text-video embedding model. Firstly, we propose a multi-stream attention architecture to leverage signals from both audio and video modalities. This architecture consists of separate encoders for the two modalities and a single decoder that attends over them. We show that this architecture is better than fusing modalities at the signal level. Additionally, we also explore leveraging the visual information in a second pass model, which has also been referred to as a `deliberation model'. The deliberation model accepts audio representations and text hypotheses from the first pass ASR and combines them with a visual stream for an improved visual context-aware recognition. The proposed deliberation scheme can work on top of any well trained ASR and also enabled us to leverage the pre-trained text model to ground the hypotheses with the visual features. Our experiments on HOW2 dataset show that multi-stream and deliberation architectures are very effective at the VC-ASR task. We evaluate the proposed models for two scenarios; clean audio stream and distorted audio in which we mask out some specific words in the audio. The deliberation model outperforms the multi-stream model and achieves a relative WER improvement of 6 for the clean and masked data, respectively, compared to an audio-only model. The deliberation model also improves recovering the masked words by 59 relative.


page 2

page 6


Multiresolution and Multimodal Speech Recognition with Transformers

This paper presents an audio visual automatic speech recognition (AV-ASR...

CASA-ASR: Context-Aware Speaker-Attributed ASR

Recently, speaker-attributed automatic speech recognition (SA-ASR) has a...

Thai Wav2Vec2.0 with CommonVoice V8

Recently, Automatic Speech Recognition (ASR), a system that converts aud...

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

End-to-end (E2E) spoken language understanding (SLU) systems that genera...

A context-aware knowledge transferring strategy for CTC-based ASR

Non-autoregressive automatic speech recognition (ASR) modeling has recei...

Learning behavioral context recognition with multi-stream temporal convolutional networks

Smart devices of everyday use (such as smartphones and wearables) are in...

End-to-End Multi-Person Audio/Visual Automatic Speech Recognition

Traditionally, audio-visual automatic speech recognition has been studie...

Please sign up or login with your details

Forgot password? Click here to reset