Learning Spatial-Temporal Graphs for Active Speaker Detection

12/02/2021
by   Sourya Roy, et al.
0

We address the problem of active speaker detection through a new framework, called SPELL, that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We cast active speaker detection as a node classification task that is aware of longer-term dependencies. We first construct a graph from a video so that each node corresponds to one person. Nodes representing the same identity share edges between them within a defined temporal window. Nodes within the same video frame are also connected to encode inter-person interactions. Through extensive experiments on the Ava-ActiveSpeaker dataset, we demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance. SPELL outperforms several relevant baselines and performs at par with state of the art models while requiring an order of magnitude lower computation cost.

READ FULL TEXT

page 3

page 5

page 8

research
07/15/2022

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Active speaker detection (ASD) in videos with multiple speakers is a cha...
research
01/19/2023

LoCoNet: Long-Short Context Network for Active Speaker Detection

Active Speaker Detection (ASD) aims to identify who is speaking in each ...
research
03/29/2016

Cross-modal Supervision for Learning Active Speaker Detection in Video

In this paper, we show how to use audio to supervise the learning of act...
research
06/07/2021

How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild

Successful active speaker detection requires a three-stage pipeline: (i)...
research
09/05/2023

Exploiting Spatial-temporal Data for Sleep Stage Classification via Hypergraph Learning

Sleep stage classification is crucial for detecting patients' health con...
research
07/16/2022

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

Perception of auditory events is inherently multimodal relying on both a...
research
03/05/2023

Heterogeneous Graph Learning for Acoustic Event Classification

Heterogeneous graphs provide a compact, efficient, and scalable way to m...

Please sign up or login with your details

Forgot password? Click here to reset