Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

02/16/2021
by   Swathikiran Sudhakaran, et al.
0

We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling in EgoACO, we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a re-designed output gate. Action, object and context descriptors are fused by a multi-head prediction that accounts for the inter-dependencies between noun-verb-action structured labels in egocentric video datasets. EgoACO features built-in visual explanations, helping learning and interpretation. Results on the two largest egocentric action recognition datasets currently available, EPIC-KITCHENS and EGTEA, show that by explicitly decoding action-context-object descriptors, EgoACO achieves state-of-the-art recognition performance.

READ FULL TEXT

page 11

page 14

research
01/19/2017

Higher-order Pooling of CNN Features via Kernel Linearization for Action Recognition

Most successful deep learning algorithms for action recognition extend m...
research
08/29/2018

Top-down Attention Recurrent VLAD Encoding for Action Recognition in Videos

Most recent approaches for action recognition from video leverage deep a...
research
12/06/2015

Rank Pooling for Action Recognition

We propose a function-based temporal pooling method that captures the la...
research
12/02/2021

Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips

First-person action recognition is a challenging task in video understan...
research
11/04/2017

Attentional Pooling for Action Recognition

We introduce a simple yet surprisingly powerful model to incorporate att...
research
12/15/2020

GTA: Global Temporal Attention for Video Action Understanding

Self-attention learns pairwise interactions via dot products to model lo...
research
02/02/2021

GCF-Net: Gated Clip Fusion Network for Video Action Recognition

In recent years, most of the accuracy gains for video action recognition...

Please sign up or login with your details

Forgot password? Click here to reset