Self-Supervised Vision-Based Detection of the Active Speaker as a Prerequisite for Socially-Aware Language Acquisition

11/24/2017
by   Kalin Stefanov, et al.
0

This paper presents a self-supervised method for detecting the active speaker in a multi-person spoken interaction scenario. We argue that this capability is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. Our methods are able to detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Our methods do not rely on external annotations, thus complying with cognitive development. Instead, they use information from the auditory modality to support learning in the visual domain. The methods have been extensively evaluated on a large multi-person face-to-face interaction dataset. The results reach an accuracy of 80 multi-speaker setting. We believe this system represents an essential component of any artificial cognitive system or robotic platform engaging in social interaction.

READ FULL TEXT

page 1

page 2

page 4

page 7

page 10

research
08/17/2021

Look Who's Talking: Active Speaker Detection in the Wild

In this work, we present a novel audio-visual dataset for active speaker...
research
08/10/2020

Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visua...
research
03/30/2022

Using Active Speaker Faces for Diarization in TV shows

Speaker diarization is one of the critical components of computational m...
research
01/05/2019

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Active speaker detection is an important component in video analysis alg...
research
03/29/2016

Cross-modal Supervision for Learning Active Speaker Detection in Video

In this paper, we show how to use audio to supervise the learning of act...
research
07/29/2022

Face-to-Face Contrastive Learning for Social Intelligence Question-Answering

Creating artificial social intelligence - algorithms that can understand...
research
05/28/2022

Is Lip Region-of-Interest Sufficient for Lipreading?

Lip region-of-interest (ROI) is conventionally used for visual input in ...

Please sign up or login with your details

Forgot password? Click here to reset