UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

by   Kunchang Li, et al.

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90 available at https://github.com/OpenGVLab/UniFormerV2.


Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning

It is a challenging task to learn rich and multi-scale spatiotemporal se...

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

It is a challenging task to learn discriminative representation from ima...

Is Space-Time Attention All You Need for Video Understanding?

We present a convolution-free approach to video classification built exc...

FasterViT: Fast Vision Transformers with Hierarchical Attention

We design a new family of hybrid CNN-ViT neural networks, named FasterVi...

TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization

Temporal Action Localization (TAL) is a challenging task in video unders...

EchoCoTr: Estimation of the Left Ventricular Ejection Fraction from Spatiotemporal Echocardiography

Learning spatiotemporal features is an important task for efficient vide...

Self-supervised Video Transformer

In this paper, we propose self-supervised training for video transformer...

Please sign up or login with your details

Forgot password? Click here to reset