Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

04/01/2022
by   Ryota Hashiguchi, et al.
0

We propose Multi-head Self/Cross-Attention (MSCA), which introduces a temporal cross-attention mechanism for action recognition, based on the structure of the Multi-head Self-Attention (MSA) mechanism of the Vision Transformer (ViT). Simply applying ViT to each frame of a video frame can capture frame features, but cannot model temporal features. However, simply modeling temporal information with CNN or Transfomer is computationally expensive. TSM that perform feature shifting assume a CNN and cannot take advantage of the ViT structure. The proposed model captures temporal information by shifting the Query, Key, and Value in the calculation of MSA of ViT. This is efficient without additional coinformationmputational effort and is a suitable structure for extending ViT over temporal. Experiments on Kineitcs400 show the effectiveness of the proposed method and its superiority over previous methods.

READ FULL TEXT
research
12/15/2020

GTA: Global Temporal Attention for Video Action Understanding

Self-attention learns pairwise interactions via dot products to model lo...
research
03/19/2018

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Research in human action recognition has accelerated significantly since...
research
05/19/2022

Cross-Enhancement Transformer for Action Segmentation

Temporal convolutions have been the paradigm of choice in action segment...
research
05/25/2023

Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective

Understanding action recognition in egocentric videos has emerged as a v...
research
04/27/2021

Three-stream network for enriched Action Recognition

Understanding accurate information on human behaviours is one of the mos...
research
04/15/2022

Model-agnostic Multi-Domain Learning with Domain-Specific Adapters for Action Recognition

In this paper, we propose a multi-domain learning model for action recog...
research
12/29/2021

Temporal Attention Augmented Transformer Hawkes Process

In recent years, mining the knowledge from asynchronous sequences by Haw...

Please sign up or login with your details

Forgot password? Click here to reset