On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition

09/15/2022
by   Farrukh Rahman, et al.
0

Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and SomethingSomething-V2) and perform thorough analysis and ablation studies to explain this observation using the predominant features of video transformer architectures. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well. Our experiments inform our recommendation that semi-supervised learning video work should consider the use of video transformers in the future.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2021

Semi-Supervised Vision Transformers

We study the training of Vision Transformers for semi-supervised image c...
research
08/11/2022

Semi-supervised Vision Transformers at Scale

We study semi-supervised learning (SSL) for vision transformers (ViT), a...
research
02/29/2020

VideoSSL: Semi-Supervised Learning for Video Classification

We propose a semi-supervised learning approach for video classification,...
research
04/02/2021

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

We design a family of image classification architectures that optimize t...
research
10/12/2021

Trivial or impossible – dichotomous data difficulty masks model differences (on ImageNet and beyond)

"The power of a generalization system follows directly from its biases" ...
research
01/16/2022

Video Transformers: A Survey

Transformer models have shown great success modeling long-range interact...
research
09/08/2022

Video Vision Transformers for Violence Detection

Law enforcement and city safety are significantly impacted by detecting ...

Please sign up or login with your details

Forgot password? Click here to reset