Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition

01/25/2022
by   Kiyoon Kim, et al.
0

We address the problem of capturing temporal information for video classification in 2D networks, without increasing computational cost. Existing approaches focus on modifying the architecture of 2D networks (e.g. by including filters in the temporal dimension to turn them into 3D networks, or using optical flow, etc.), which increases computation cost. Instead, we propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes. We observe that without bells and whistles, the proposed sampling strategy improves performance on multiple architectures (e.g. TSN, TRN, and TSM) and datasets (CATER, Something-Something-V1 and V2), up to 24 standard video input. In addition, our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing. Given the generality of the results and the flexibility of the approach, we hope this can be widely useful to the video understanding community. Code is available at https://github.com/kiyoon/PyVideoAI.

READ FULL TEXT
research
03/25/2021

An Image is Worth 16x16 Words, What is a Video Worth?

Leading methods in the domain of action recognition try to distill infor...
research
07/19/2019

Only Time Can Tell: Discovering Temporal Data for Temporal Modeling

Understanding temporal information and how the visual world changes over...
research
01/13/2023

Analyzing and Improving the Pyramidal Predictive Network for Future Video Frame Prediction

The pyramidal predictive network (PPNV1) proposes an interesting tempora...
research
06/10/2019

FASTER Recurrent Networks for Video Classification

Video classification methods often divide the video into short clips, do...
research
06/07/2022

Revealing Single Frame Bias for Video-and-Language Learning

Training an effective video-and-language model intuitively requires mult...
research
05/14/2020

TAM: Temporal Adaptive Module for Video Recognition

Temporal modeling is crucial for capturing spatiotemporal structure in v...
research
07/21/2022

Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation

The essence of video semantic segmentation (VSS) is how to leverage temp...

Please sign up or login with your details

Forgot password? Click here to reset