Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

01/06/2022
by   Hao Jiang, et al.
0

Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due to their egocentric nature: the wearer's head motion may cause motion blur, surrounding people may appear in difficult viewing angles, and there may be occlusions, visual clutter, audio noise, and bad lighting. Under these conditions, previous state-of-the-art active speaker detection methods do not give satisfactory results. Instead, we tackle the problem from a new setting using both video and multi-channel microphone array audio. We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results. In contrast to previous methods, our method localizes active speakers from all possible directions on the sphere, even outside the camera's field of view, while simultaneously detecting the device wearer's own voice activity. Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.

READ FULL TEXT

page 1

page 4

page 7

research
03/09/2020

Crossmodal learning for audio-visual speech event localization

An objective understanding of media depictions, such as about inclusive ...
research
08/21/2020

RespVAD: Voice Activity Detection via Video-Extracted Respiration Patterns

Voice Activity Detection (VAD) refers to the task of identification of r...
research
10/14/2022

Intel Labs at Ego4D Challenge 2022: A Better Baseline for Audio-Visual Diarization

This report describes our approach for the Audio-Visual Diarization (AVD...
research
04/17/2019

Understanding the Effectiveness of Ultrasonic Microphone Jammer

Recent works have explained the principle of using ultrasonic transmissi...
research
03/28/2023

Egocentric Auditory Attention Localization in Conversations

In a noisy conversation environment such as a dinner party, people often...
research
06/09/2022

Audio-video fusion strategies for active speaker detection in meetings

Meetings are a common activity in professional contexts, and it remains ...
research
09/15/2023

A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism

We introduce a distinctive real-time, causal, neural network-based activ...

Please sign up or login with your details

Forgot password? Click here to reset