VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

by   Limin Wang, et al.

Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0 89.9 addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner.


page 1

page 2

page 3

page 4


Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-superv...

SEPT: Towards Scalable and Efficient Visual Pre-Training

Recently, the self-supervised pre-training paradigm has shown great pote...

EDMAE: An Efficient Decoupled Masked Autoencoder for Standard View Identification in Pediatric Echocardiography

We propose an efficient decoupled mask autoencoder (EDMAE) for standard ...

MGMAE: Motion Guided Masking for Video Masked Autoencoding

Masked autoencoding has shown excellent performance on self-supervised v...

FactorMatte: Redefining Video Matting for Re-Composition Tasks

We propose "factor matting", an alternative formulation of the video mat...

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Video Foundation Models (VFMs) have received limited exploration due to ...

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Pre-training video transformers on extra large-scale datasets is general...

Please sign up or login with your details

Forgot password? Click here to reset