Gate-Shift-Fuse for Video Action Recognition

by   Swathikiran Sudhakaran, et al.

Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose input tensor and channel weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D CNNs to convert them into an efficient and high performing spatio-temporal feature extractor, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks. Code and models will be made publicly available at


page 1

page 5

page 10

page 14


Gate-Shift Networks for Video Action Recognition

Deep 3D CNNs for video action recognition are designed to learn powerful...

SAIC_Cambridge-HuPBA-FBK Submission to the EPIC-Kitchens-100 Action Recognition Challenge 2021

This report presents the technical details of our submission to the EPIC...

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Convolutional neural networks with spatio-temporal 3D kernels (3D CNNs) ...

Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition

In recent years, a number of approaches based on 2D CNNs and 3D CNNs hav...

An Image Classifier Can Suffice For Video Understanding

We propose a new perspective on video understanding by casting the video...

Gradient Weighted Superpixels for Interpretability in CNNs

As Convolutional Neural Networks embed themselves into our everyday live...

Depthwise Spatio-Temporal STFT Convolutional Neural Networks for Human Action Recognition

Conventional 3D convolutional neural networks (CNNs) are computationally...

Please sign up or login with your details

Forgot password? Click here to reset