A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

by   Wuyang Chen, et al.

This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks. Transformers recently demonstrate competitive performance in image classification tasks. To adopt ViT to object detection and dense prediction tasks, many works inherit the multistage design from convolutional networks and highly customized ViT architectures. Behind this design, the goal is to pursue a better trade-off between computational cost and effective aggregation of multiscale global contexts. However, existing works adopt the multistage architectural design as a black-box solution without a clear understanding of its true benefits. In this paper, we comprehensively study three architecture design choices on ViT – spatial reduction, doubled channels, and multiscale features – and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy. We further complete a scaling rule to optimize our model's trade-off on accuracy and computation cost / model size. By leveraging a constant feature resolution and hidden size throughout the encoder blocks, we propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) that achieves strong performance on COCO object detection and instance segmentation tasks.


page 1

page 2

page 3

page 4


Bottleneck Transformers for Visual Recognition

We present BoTNet, a conceptually simple yet powerful backbone architect...

Joint COCO and Mapillary Workshop at ICCV 2019: COCO Instance Segmentation Challenge Track

In this report, we present our object detection/instance segmentation sy...

CycleMLP: A MLP-like Architecture for Dense Prediction

This paper presents a simple MLP-like architecture, CycleMLP, which is a...

Unifying Visual Perception by Dispersible Points Learning

We present a conceptually simple, flexible, and universal visual percept...

BoxeR: Box-Attention for 2D and 3D Transformers

In this paper, we propose a simple attention mechanism, we call Box-Atte...

CARAFE: Content-Aware ReAssembly of FEatures

Feature upsampling is a key operation in a number of modern convolutiona...

Improved Multiscale Vision Transformers for Classification and Detection

In this paper, we study Multiscale Vision Transformers (MViT) as a unifi...

Please sign up or login with your details

Forgot password? Click here to reset