Exploring the Diversity and Invariance in Yourself for Visual Pre-Training Task

06/01/2021
by   Longhui Wei, et al.
0

Recently, self-supervised learning methods have achieved remarkable success in visual pre-training task. By simply pulling the different augmented views of each image together or other novel mechanisms, they can learn much unsupervised knowledge and significantly improve the transfer performance of pre-training models. However, these works still cannot avoid the representation collapse problem, i.e., they only focus on limited regions or the extracted features on totally different regions inside each image are nearly the same. Generally, this problem makes the pre-training models cannot sufficiently describe the multi-grained information inside images, which further limits the upper bound of their transfer performance. To alleviate this issue, this paper introduces a simple but effective mechanism, called Exploring the Diversity and Invariance in Yourself E-DIY. By simply pushing the most different regions inside each augmented view away, E-DIY can preserve the diversity of extracted region-level features. By pulling the most similar regions from different augmented views of the same image together, E-DIY can ensure the robustness of region-level features. Benefited from the above diversity and invariance exploring mechanism, E-DIY maximally extracts the multi-grained visual information inside each image. Extensive experiments on downstream tasks demonstrate the superiority of our proposed approach, e.g., there are 2.1 compared with the strong baseline BYOL on COCO while fine-tuning Mask R-CNN with the R50-C4 backbone and 1X learning schedule.

READ FULL TEXT

page 2

page 5

research
12/06/2019

ClusterFit: Improving Generalization of Visual Representations

Pre-training convolutional neural networks with weakly-supervised and se...
research
03/10/2022

MVP: Multimodality-guided Visual Pre-training

Recently, masked image modeling (MIM) has become a promising direction f...
research
07/11/2022

A Closer Look at Invariances in Self-supervised Pre-training for 3D Vision

Self-supervised pre-training for 3D vision has drawn increasing research...
research
04/03/2023

Disentangled Pre-training for Image Matting

Image matting requires high-quality pixel-level human annotations to sup...
research
11/27/2022

MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding

Document images are a ubiquitous source of data where the text is organi...
research
03/08/2022

Part-Aware Self-Supervised Pre-Training for Person Re-Identification

In person re-identification (ReID), very recent researches have validate...

Please sign up or login with your details

Forgot password? Click here to reset