Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition

10/22/2020
by   Chun-Fu Chen, et al.
0

In recent years, a number of approaches based on 2D CNNs and 3D CNNs have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out an in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop a unified framework for both 2D-CNN and 3D-CNN action models, which enables us to remove bells and whistles and provides a common ground for a fair comparison. We then conduct an effort towards a large-scale analysis involving over 300 action recognition models. Our comprehensive analysis reveals that a) a significant leap is made in efficiency for action recognition, but not in accuracy; b) 2D-CNN and 3D-CNN models behave similarly in terms of spatio-temporal representation abilities and transferability. Our analysis also shows that recent action models seem to be able to learn data-dependent temporality flexibly as needed. Our codes and models are available on https://github.com/IBM/action-recognition-pytorch.

READ FULL TEXT
research
07/04/2022

Large-scale Robustness Analysis of Video Action Recognition Models

We have seen a great progress in video action recognition in recent year...
research
08/15/2016

Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition

This paper performs the first investigation into depth for large-scale h...
research
04/25/2022

Temporal Relevance Analysis for Video Action Models

In this paper, we provide a deep analysis of temporal modeling for actio...
research
06/26/2021

An Image Classifier Can Suffice For Video Understanding

We propose a new perspective on video understanding by casting the video...
research
01/08/2023

STPrivacy: Spatio-Temporal Tubelet Sparsification and Anonymization for Privacy-preserving Action Recognition

Recently privacy-preserving action recognition (PPAR) has been becoming ...
research
02/02/2021

GCF-Net: Gated Clip Fusion Network for Video Action Recognition

In recent years, most of the accuracy gains for video action recognition...
research
03/16/2022

Gate-Shift-Fuse for Video Action Recognition

Convolutional Neural Networks are the de facto models for image recognit...

Please sign up or login with your details

Forgot password? Click here to reset