Green Hierarchical Vision Transformer for Masked Image Modeling

05/26/2022
by   Lang Huang, et al.
0

We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), e.g., Swin Transformer, allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of two key components. First, for the window attention, we design a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. As a result, MIM now can work on hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs about 2.7× faster and reduce the GPU memory usage by 70 and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.

READ FULL TEXT
research
12/28/2021

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Recently, Transformers have shown promising performance in various visio...
research
09/29/2022

Dilated Neighborhood Attention Transformer

Transformers are quickly becoming one of the most heavily applied deep l...
research
06/09/2023

FasterViT: Fast Vision Transformers with Hierarchical Attention

We design a new family of hybrid CNN-ViT neural networks, named FasterVi...
research
04/13/2023

RSIR Transformer: Hierarchical Vision Transformer using Random Sampling Windows and Important Region Windows

Recently, Transformers have shown promising performance in various visio...
research
01/24/2022

Patches Are All You Need?

Although convolutional networks have been the dominant architecture for ...
research
03/21/2023

The Multiscale Surface Vision Transformer

Surface meshes are a favoured domain for representing structural and fun...
research
05/26/2022

Fast Vision Transformers with HiLo Attention

Vision Transformers (ViTs) have triggered the most recent and significan...

Please sign up or login with your details

Forgot password? Click here to reset