FLSL: Feature-level Self-supervised Learning

by   Qing Su, et al.

Current self-supervised learning (SSL) methods (e.g., SimCLR, DINO, VICReg, MOCOv3) target primarily on representations at instance level and do not generalize well to dense prediction tasks, such as object detection and segmentation. Towards aligning SSL with dense predictions, this paper demonstrates for the first time the underlying mean-shift clustering process of Vision Transformers (ViT), which aligns well with natural image semantics (e.g., a world of objects and stuffs). By employing transformer for joint embedding and clustering, we propose a two-level feature clustering SSL method, coined Feature-Level Self-supervised Learning (FLSL). We present the formal definition of the FLSL problem and construct the objectives from the mean-shift and k-means perspectives. We show that FLSL promotes remarkable semantic cluster representations and learns an embedding scheme amenable to intra-view and inter-view feature clustering. Experiments show that FLSL yields significant improvements in dense prediction tasks, achieving 44.9 (+2.8) and 46.5 instance segmentation on MS-COCO, using Mask R-CNN with ViT-S/16 and ViT-S/8 as backbone, respectively. FLSL consistently outperforms existing SSL methods across additional benchmarks, including UAV object detection on UAVDT, and video instance segmentation on DAVIS 2017. We conclude by presenting visualization and various ablation studies to better 20 understand the success of FLSL.


page 8

page 21


Self-Supervised Learning with Swin Transformers

We are witnessing a modeling shift from CNN to Transformers in computer ...

Object-Aware Cropping for Self-Supervised Learning

A core component of the recent success of self-supervised learning is cr...

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

This paper demonstrates an approach for learning highly semantic image r...

Self-Supervised Learning of Object Parts for Semantic Segmentation

Progress in self-supervised learning has brought strong general image re...

Mean Shift Mask Transformer for Unseen Object Instance Segmentation

Segmenting unseen objects is a critical task in many different domains. ...

Learning Where to Learn in Cross-View Self-Supervised Learning

Self-supervised learning (SSL) has made enormous progress and largely na...

Please sign up or login with your details

Forgot password? Click here to reset