mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

03/29/2022
by   Xiaotong Li, et al.
3

Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT, casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE. Despite a feasible solution, the improper discretization hinders further improvements of image pre-training. Since image discretization has no ground-truth answers, we believe that the masked patch should not be assigned with a unique token id even if a better tokenizer can be obtained. In this work, we introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives. Specifically, the multi-choice supervision for the masked image patches is formed by the soft probability vectors of the discrete token ids, which are predicted by the off-the-shelf image tokenizer and further refined by high-level inter-patch perceptions resorting to the observation that similar patches should share their choices. Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pre-trained ViT-B achieves 84.1 accuracy on ImageNet-1K classification, 51.2 segmentation, 51.2 segmentation on COCO, outperforming the competitive counterparts.

READ FULL TEXT

page 2

page 10

research
07/27/2022

Point-McBert: A Multi-choice Self-supervised Framework for Point Cloud Pre-training

Masked language modeling (MLM) has become one of the most successful sel...
research
03/08/2023

Centroid-centered Modeling for Efficient Vision Transformer Pre-training

Masked Image Modeling (MIM) is a new self-supervised vision pre-training...
research
06/10/2020

MC-BERT: Efficient Language Pre-Training via a Meta Controller

Pre-trained contextual representations (e.g., BERT) have become the foun...
research
04/03/2022

POS-BERT: Point Cloud One-Stage BERT Pre-Training

Recently, the pre-training paradigm combining Transformer and masked lan...
research
03/09/2023

Masked Image Modeling with Local Multi-Scale Reconstruction

Masked Image Modeling (MIM) achieves outstanding success in self-supervi...
research
01/30/2023

Advancing Radiograph Representation Learning with Masked Record Modeling

Modern studies in radiograph representation learning rely on either self...
research
10/13/2022

Exploring Long-Sequence Masked Autoencoders

Masked Autoencoding (MAE) has emerged as an effective approach for pre-t...

Please sign up or login with your details

Forgot password? Click here to reset