AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

06/25/2023
by   Jiuxin Lin, et al.
0

Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the visual feature. Then self- and cross-attention are employed to model and fuse the multi-modal features. Furthermore, we use a novel 2D positional encoding, that introduces the positional information between and within chunks and provides significant gains over the traditional positional encoding. Our model has two key advantages: the time granularity of audio chunked feature is synchronized to the visual feature, which alleviates the harm caused by the inconsistency of audio and video sampling rate; by combining self- and cross-attention, feature fusion and speech extraction processes are unified within an attention paradigm. The experimental results show that AV-SepFormer significantly outperforms other existing methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/16/2023

SCANet: A Self- and Cross-Attention Network for Audio-Visual Speech Separation

The integration of different modalities, such as audio and visual inform...
research
07/09/2022

Dual-path Attention is All You Need for Audio-Visual Speech Extraction

Audio-visual target speech extraction, which aims to extract a certain s...
research
02/02/2021

Multimodal Attention Fusion for Target Speaker Extraction

Target speaker extraction, which aims at extracting a target speaker's v...
research
09/12/2023

DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention

With the rise in manipulated media, deepfake detection has become an imp...
research
03/24/2022

Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3

We propose a cross-modal co-attention model for continuous emotion recog...
research
02/04/2023

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

Lipreading refers to understanding and further translating the speech of...
research
10/04/2022

Pay Self-Attention to Audio-Visual Navigation

Audio-visual embodied navigation, as a hot research topic, aims training...

Please sign up or login with your details

Forgot password? Click here to reset