Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

05/15/2022
by   Bowen Shi, et al.
0

This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38 and 75

READ FULL TEXT
research
07/14/2022

A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer

While audio-visual speech models can yield superior performance and robu...
research
02/15/2022

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

With the advance in self-supervised learning for audio and visual modali...
research
03/14/2023

BLAT: Bootstrapping Language-Audio Pre-training based on AudioSet Tag-guided Synthetic Data

Compared with ample visual-text pre-training research, few works explore...
research
06/27/2023

Large-scale unsupervised audio pre-training for video-to-speech synthesis

Video-to-speech synthesis is the task of reconstructing the speech signa...
research
04/09/2023

Token Boosting for Robust Self-Supervised Visual Transformer Pre-training

Learning with large-scale unlabeled data has become a powerful tool for ...
research
06/09/2022

Audio-video fusion strategies for active speaker detection in meetings

Meetings are a common activity in professional contexts, and it remains ...
research
10/25/2022

Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach

Recovering the masked speech frames is widely applied in speech represen...

Please sign up or login with your details

Forgot password? Click here to reset