BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning

by   Zhi Hou, et al.

Attention mechanisms have been very popular in deep neural networks, where the Transformer architecture has achieved great success in not only natural language processing but also visual recognition applications. Recently, a new Transformer module, applying on batch dimension rather than spatial/channel dimension, i.e., BatchFormer [18], has been introduced to explore sample relationships for overcoming data scarcity challenges. However, it only works with image-level representations for classification. In this paper, we devise a more general batch Transformer module, BatchFormerV2, which further enables exploring sample relationships for dense representation learning. Specifically, when applying the proposed module, it employs a two-stream pipeline during training, i.e., either with or without a BatchFormerV2 module, where the batchformer stream can be removed for testing. Therefore, the proposed method is a plug-and-play module and can be easily integrated into different vision Transformers without any extra inference cost. Without bells and whistles, we show the effectiveness of the proposed method for a variety of popular visual recognition tasks, including image classification and two important dense prediction tasks: object detection and panoptic segmentation. Particularly, BatchFormerV2 consistently improves current DETR-based detection methods (e.g., DETR, Deformable-DETR, Conditional DETR, and SMCA) by over 1.3 made publicly available.


page 11

page 12

page 22

page 24

page 25


BatchFormer: Learning to Explore Sample Relationships for Robust Representation Learning

Despite the success of deep neural networks, there are still many challe...

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learn...

Rethinking Batch Sample Relationships for Data Representation: A Batch-Graph Transformer based Approach

Exploring sample relationships within each mini-batch has shown great po...

DPT: Deformable Patch-based Transformer for Visual Recognition

Transformer has achieved great success in computer vision, while how to ...

Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

Studies on self-supervised visual representation learning (SSL) improve ...

Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective

Visual representation learning is the key of solving various vision prob...

Spatial Cross-Attention Improves Self-Supervised Visual Representation Learning

Unsupervised representation learning methods like SwAV are proved to be ...

Please sign up or login with your details

Forgot password? Click here to reset