SEPT: Towards Scalable and Efficient Visual Pre-Training

12/11/2022
by   Yiqi Lin, et al.
0

Recently, the self-supervised pre-training paradigm has shown great potential in leveraging large-scale unlabeled data to improve downstream task performance. However, increasing the scale of unlabeled pre-training data in real-world scenarios requires prohibitive computational costs and faces the challenge of uncurated samples. To address these issues, we build a task-specific self-supervised pre-training framework from a data selection perspective based on a simple hypothesis that pre-training on the unlabeled samples with similar distribution to the target task can bring substantial performance gains. Buttressed by the hypothesis, we propose the first yet novel framework for Scalable and Efficient visual Pre-Training (SEPT) by introducing a retrieval pipeline for data selection. SEPT first leverage a self-supervised pre-trained model to extract the features of the entire unlabeled dataset for retrieval pipeline initialization. Then, for a specific target task, SEPT retrievals the most similar samples from the unlabeled dataset based on feature similarity for each target instance for pre-training. Finally, SEPT pre-trains the target model with the selected unlabeled samples in a self-supervised manner for target data finetuning. By decoupling the scale of pre-training and available upstream data for a target task, SEPT achieves high scalability of the upstream dataset and high efficiency of pre-training, resulting in high model architecture flexibility. Results on various downstream tasks demonstrate that SEPT can achieve competitive or even better performance compared with ImageNet pre-training while reducing the size of training samples by one magnitude without resorting to any extra annotations.

READ FULL TEXT
research
12/20/2021

Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

Pre-training models on large scale datasets, like ImageNet, is a standar...
research
04/25/2021

How Well Self-Supervised Pre-Training Performs with Streaming Data?

The common self-supervised pre-training practice requires collecting mas...
research
04/09/2023

Token Boosting for Robust Self-Supervised Visual Transformer Pre-training

Learning with large-scale unlabeled data has become a powerful tool for ...
research
04/03/2023

Disentangled Pre-training for Image Matting

Image matting requires high-quality pixel-level human annotations to sup...
research
03/27/2023

Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval

Recent multilingual pre-trained models have shown better performance in ...
research
07/17/2023

Revisiting Scene Text Recognition: A Data Perspective

This paper aims to re-assess scene text recognition (STR) from a data-or...
research
03/29/2023

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Scale is the primary factor for building a powerful foundation model tha...

Please sign up or login with your details

Forgot password? Click here to reset