Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

09/15/2023
by   Junjie Li, et al.
0

Target speaker extraction aims to extract the speech of a specific speaker from a multi-talker mixture as specified by an auxiliary reference. Most studies focus on the scenario where the target speech is highly overlapped with the interfering speech. However, this scenario only accounts for a small percentage of real-world conversations. In this paper, we aim at the sparsely overlapped scenarios in which the auxiliary reference needs to perform two tasks simultaneously: detect the activity of the target speaker and disentangle the active speech from any interfering speech. We propose an audio-visual speaker extraction model named ActiveExtract, which leverages speaking activity from audio-visual active speaker detection (ASD). The ASD directly provides the frame-level activity of the target speaker, while its intermediate feature representation is trained to discriminate speech-lip synchronization that could be used for speaker disentanglement. Experimental results show our model outperforms baselines across various overlapping ratios, achieving an average improvement of more than 4 dB in terms of SI-SNR.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2021

USEV: Universal Speaker Extraction with Visual Cue

A speaker extraction algorithm seeks to extract the target speaker's voi...
research
05/22/2023

Target Active Speaker Detection with Audio-visual Cues

In active speaker detection (ASD), we would like to detect whether an on...
research
06/17/2022

Simultaneous Speech Extraction for Multiple Target Speakers under the Meeting Scenarios(V1)

Recently, the target speech separation or extraction techniques under th...
research
09/19/2023

USED: Universal Speaker Extraction and Diarization

Speaker extraction and diarization are two crucial enabling techniques f...
research
10/31/2022

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

The speaker extraction technique seeks to single out the voice of a targ...
research
10/09/2022

VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Speaker extraction seeks to extract the target speech in a multi-talker ...
research
02/01/2022

New Insights on Target Speaker Extraction

In recent years, researchers have become increasingly interested in spea...

Please sign up or login with your details

Forgot password? Click here to reset