Evaluating Transformers for Lightweight Action Recognition

11/18/2021
by   Raivo Koot, et al.
0

In video action recognition, transformers consistently reach state-of-the-art accuracy. However, many models are too heavyweight for the average researcher with limited hardware resources. In this work, we explore the limitations of video transformers for lightweight action recognition. We benchmark 13 video transformers and baselines across 3 large-scale datasets and 10 hardware devices. Our study is the first to evaluate the efficiency of action recognition models in depth across multiple devices and train a wide range of video transformers under the same conditions. We categorize current methods into three classes and show that composite transformers that augment convolutional backbones are best at lightweight action recognition, despite lacking accuracy. Meanwhile, attention-only models need more motion modeling capabilities and stand-alone attention block models currently incur too much latency overhead. Our experiments conclude that current video transformers are not yet capable of lightweight action recognition on par with traditional convolutional baselines, and that the previously mentioned shortcomings need to be addressed to bridge this gap. Code to reproduce our experiments will be made publicly available.

READ FULL TEXT
research
07/01/2021

VideoLightFormer: Lightweight Action Recognition using Transformers

Efficient video action recognition remains a challenging problem. One la...
research
06/09/2021

Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition

With the recent surge in the research of vision transformers, they have ...
research
12/14/2021

Co-training Transformer with Videos and Images Improves Action Recognition

In learning action recognition, models are typically pre-trained on obje...
research
05/21/2019

Lightweight Network Architecture for Real-Time Action Recognition

In this work we present a new efficient approach to Human Action Recogni...
research
06/06/2015

First-Take-All: Temporal Order-Preserving Hashing for 3D Action Videos

With the prevalence of the commodity depth cameras, the new paradigm of ...
research
05/31/2023

Humans in 4D: Reconstructing and Tracking Humans with Transformers

We present an approach to reconstruct humans and track them over time. A...
research
08/25/2023

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

Vision Transformers achieve impressive accuracy across a range of visual...

Please sign up or login with your details

Forgot password? Click here to reset