Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

by   Wangmeng Xiang, et al.

Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of self-attention computation. How to efficiently and effectively model the 3D self-attention of video data has been a great challenge for transformers. In this paper, we propose a Temporal Patch Shift (TPS) method for efficient 3D self-attention modeling in transformers for video-based action recognition. TPS shifts part of patches with a specific mosaic pattern in the temporal dimension, thus converting a vanilla spatial self-attention operation to a spatiotemporal one with little additional cost. As a result, we can compute 3D self-attention using nearly the same computation and memory cost as 2D self-attention. TPS is a plug-and-play module and can be inserted into existing 2D transformer models to enhance spatiotemporal feature learning. The proposed method achieves competitive performance with state-of-the-arts on Something-something V1 V2, Diving-48, and Kinetics400 while being much more efficient on computation and memory cost. The source code of TPS can be found at


page 2

page 7

page 16

page 17


Video Swin Transformer

The vision community is witnessing a modeling shift from CNNs to Transfo...

Efficient Attention-free Video Shift Transformers

This paper tackles the problem of efficient video recognition. In this a...

Knowledge Fusion Transformers for Video Action Recognition

We introduce Knowledge Fusion Transformers for video action classificati...

Is Space-Time Attention All You Need for Video Understanding?

We present a convolution-free approach to video classification built exc...

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

In video transformers, the time dimension is often treated in the same w...

MDAESF: Cine MRI Reconstruction Based on Motion-Guided Deformable Alignment and Efficient Spatiotemporal Self-Attention Fusion

Cine MRI can jointly obtain the continuous influence of the anatomical s...

Life Regression based Patch Slimming for Vision Transformers

Vision transformers have achieved remarkable success in computer vision ...

Please sign up or login with your details

Forgot password? Click here to reset