Motion-Guided Masking for Spatiotemporal Representation Learning

by   David Fan, et al.

Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +1.3% improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to 66% fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +4.9% improvement compared to baseline methods.


page 3

page 8

page 9

page 13

page 14


Disentangling Propagation and Generation for Video Prediction

Learning to predict future video frames is a challenging task. Recent ap...

Can Temporal Information Help with Contrastive Self-Supervised Learning?

Leveraging temporal information has been regarded as essential for devel...

Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders

Masked autoencoders (MAEs) have emerged recently as art self-supervised ...

Deformable Video Transformer

Video transformers have recently emerged as an effective alternative to ...

Single Image Action Recognition by Predicting Space-Time Saliency

We propose a novel approach based on deep Convolutional Neural Networks ...

MGMAE: Motion Guided Masking for Video Masked Autoencoding

Masked autoencoding has shown excellent performance on self-supervised v...

EgoDistill: Egocentric Head Motion Distillation for Efficient Video Understanding

Recent advances in egocentric video understanding models are promising, ...

Please sign up or login with your details

Forgot password? Click here to reset