Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

by   Ze Liu, et al.

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at <https://github.com/microsoft/Swin-Transformer>.


page 1

page 2

page 3

page 4


CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

We present CSWin Transformer, an efficient and effective Transformer-bas...

SepViT: Separable Vision Transformer

Vision Transformers have witnessed prevailing success in a series of vis...

Vision Transformer Adapter for Dense Predictions

This work investigates a simple yet powerful adapter for Vision Transfor...

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

Transformer models have shown great potential in computer vision, follow...

Swin Transformer V2: Scaling Up Capacity and Resolution

We present techniques for scaling Swin Transformer up to 3 billion param...

CycleMLP: A MLP-like Architecture for Dense Prediction

This paper presents a simple MLP-like architecture, CycleMLP, which is a...

ResT: An Efficient Transformer for Visual Recognition

This paper presents an efficient multi-scale vision Transformer, called ...

Please sign up or login with your details

Forgot password? Click here to reset