Understanding Masked Image Modeling via Learning Occlusion Invariant Feature

by   Xiangwen Kong, et al.
Megvii Technology Limited

Recently, Masked Image Modeling (MIM) achieves great success in self-supervised visual recognition. However, as a reconstruction-based framework, it is still an open question to understand how MIM works, since MIM appears very different from previous well-studied siamese approaches such as contrastive learning. In this paper, we propose a new viewpoint: MIM implicitly learns occlusion-invariant features, which is analogous to other siamese methods while the latter learns other invariance. By relaxing MIM formulation into an equivalent siamese form, MIM methods can be interpreted in a unified framework with conventional methods, among which only a) data transformations, i.e. what invariance to learn, and b) similarity measurements are different. Furthermore, taking MAE (He et al.) as a representative example of MIM, we empirically find the success of MIM models relates a little to the choice of similarity functions, but the learned occlusion invariant feature introduced by masked image – it turns out to be a favored initialization for vision transformers, even though the learned feature could be less semantic. We hope our findings could inspire researchers to develop more powerful self-supervised methods in computer vision community.


page 1

page 2

page 3

page 4


Visualizing and Understanding Self-Supervised Vision Learning

Self-Supervised vision learning has revolutionized deep learning, becomi...

Transitive Invariance for Self-supervised Visual Representation Learning

Learning visual representations with self-supervised learning has become...

Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework

Self-supervised learning has shown its great potential to extract powerf...

Anatomical Invariance Modeling and Semantic Alignment for Self-supervised Learning in 3D Medical Image Segmentation

Self-supervised learning (SSL) has recently achieved promising performan...

Deep Intra-Image Contrastive Learning for Weakly Supervised One-Step Person Search

Weakly supervised person search aims to perform joint pedestrian detecti...

GMML is All you Need

Vision transformers have generated significant interest in the computer ...

Self-learn to Explain Siamese Networks Robustly

Learning to compare two objects are essential in applications, such as d...

Please sign up or login with your details

Forgot password? Click here to reset