Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

05/06/2023
by   Bolin Lai, et al.
1

Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning. We conduct extensive ablation studies and thorough analysis using two egocentric video datasets: Ego4D and Aria, to validate our model design. We also demonstrate improvements over prior state-of-the-art methods. Moreover, we provide visualizations to show the gaze anticipation results and provide additional insights into audio-visual representation learning.

READ FULL TEXT

page 1

page 3

page 7

page 8

page 14

page 15

research
11/23/2020

Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

We present a novel way for self-supervised video representation learning...
research
10/24/2022

Contrastive Representation Learning for Gaze Estimation

Self-supervised learning (SSL) has become prevalent for learning represe...
research
08/08/2022

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation

In this paper, we present the first transformer-based model to address t...
research
11/18/2022

Contrastive Positive Sample Propagation along the Audio-Visual Event Line

Visual and audio signals often coexist in natural environments, forming ...
research
05/27/2021

SSAN: Separable Self-Attention Network for Video Representation Learning

Self-attention has been successfully applied to video representation lea...
research
07/07/2022

AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention Estimation for Non-Profilic Faces

In challenging real-life conditions such as extreme head-pose, occlusion...
research
12/14/2018

On Attention Modules for Audio-Visual Synchronization

With the development of media and networking technologies, multimedia ap...

Please sign up or login with your details

Forgot password? Click here to reset