Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips

12/02/2021
by   Lijin Yang, et al.
0

First-person action recognition is a challenging task in video understanding. Because of strong ego-motion and a limited field of view, many backgrounds or noisy frames in a first-person video can distract an action recognition model during its learning process. To encode more discriminative features, the model needs to have the ability to focus on the most relevant part of the video for action recognition. Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video, which is critical for determining the relatively significant parts. In this work, we propose a simple yet effective Stacked Temporal Attention Module (STAM) to compute temporal attention based on the global knowledge across clips for emphasizing the most discriminative features. We achieve this by stacking multiple self-attention layers. Instead of naive stacking, which is experimentally proven to be ineffective, we carefully design the input to each self-attention layer so that both the local and global context of the video is considered during generating the temporal attention weights. Experiments demonstrate that our proposed STAM can be built on top of most existing backbones and boost the performance in various datasets.

READ FULL TEXT

page 1

page 2

page 4

page 9

research
04/01/2023

DOAD: Decoupled One Stage Action Detection Network

Localizing people and recognizing their actions from videos is a challen...
research
12/17/2022

Inductive Attention for Video Action Anticipation

Anticipating future actions based on video observations is an important ...
research
08/04/2019

Action Recognition in Untrimmed Videos with Composite Self-Attention Two-Stream Framework

With the rapid development of deep learning algorithms, action recogniti...
research
01/12/2020

Few-shot Action Recognition via Improved Attention with Self-supervision

Most existing few-shot learning methods in computer vision focus on clas...
research
12/15/2020

GTA: Global Temporal Attention for Video Action Understanding

Self-attention learns pairwise interactions via dot products to model lo...
research
02/16/2021

Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

We present EgoACO, a deep neural architecture for video action recogniti...
research
07/03/2020

Egocentric Action Recognition by Video Attention and Temporal Context

We present the submission of Samsung AI Centre Cambridge to the CVPR2020...

Please sign up or login with your details

Forgot password? Click here to reset