A Real-Time Active Speaker Detection System Integrating an Audio-Visual Signal with a Spatial Querying Mechanism

09/15/2023
by   Ilya Gurvich, et al.
0

We introduce a distinctive real-time, causal, neural network-based active speaker detection system optimized for low-power edge computing. This system drives a virtual cinematography module and is deployed on a commercial device. The system uses data originating from a microphone array and a 360-degree camera. Our network requires only 127 MFLOPs per participant, for a meeting with 14 participants. Unlike previous work, we examine the error rate of our network when the computational budget is exhausted, and find that it exhibits graceful degradation, allowing the system to operate reasonably well even in this case. Departing from conventional DOA estimation approaches, our network learns to query the available acoustic data, considering the detected head locations. We train and evaluate our algorithm on a realistic meetings dataset featuring up to 14 participants in the same meeting, overlapped speech, and other challenging scenarios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/10/2019

Advances in Online Audio-Visual Meeting Transcription

This paper describes a system that generates speaker-annotated transcrip...
research
06/12/2020

Neural voice cloning with a few low-quality samples

In this paper, we explore the possibility of speech synthesis from low q...
research
10/22/2020

The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker Diarisation Challenge

This paper describes system setup of our submission to speaker diarisati...
research
02/10/2020

Multimodal active speaker detection and virtual cinematography for video conferencing

Active speaker detection (ASD) and virtual cinematography (VC) can signi...
research
10/26/2022

Deep Learning Based Audio-Visual Multi-Speaker DOA Estimation Using Permutation-Free Loss Function

In this paper, we propose a deep learning based multi-speaker direction ...
research
01/06/2022

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Augmented reality devices have the potential to enhance human perception...
research
09/09/2022

Reconstructing the Dynamic Directivity of Unconstrained Speech

An accurate model of natural speech directivity is an important step tow...

Please sign up or login with your details

Forgot password? Click here to reset