ATS: Adaptive Token Sampling For Efficient Vision Transformers

11/30/2021
by   Mohsen Fayyaz, et al.
12

While state-of-the-art vision transformer models achieve promising results for image classification, they are computationally very expensive and require many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by reducing the number of tokens in the network, there is no setting that is optimal for all input images. In this work, we, therefore, introduce a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture. ATS empowers vision transformers by scoring and adaptively sampling significant tokens. As a result, the number of tokens is not anymore static but it varies for each input image. By integrating ATS as an additional layer within current transformer blocks, we can convert them into much more efficient vision transformers with an adaptive number of tokens. Since ATS is a parameter-free module, it can be added to off-the-shelf pretrained vision transformers as a plug-and-play module, thus reducing their GFLOPs without any additional training. However, due to its differentiable design, one can also train a vision transformer equipped with ATS. We evaluate our module on the ImageNet dataset by adding it to multiple state-of-the-art vision transformers. Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by 37

READ FULL TEXT

page 2

page 6

page 8

research
12/14/2021

AdaViT: Adaptive Tokens for Efficient Vision Transformer

We introduce AdaViT, a method that adaptively adjusts the inference cost...
research
06/19/2023

RaViTT: Random Vision Transformer Tokens

Vision Transformers (ViTs) have successfully been applied to image class...
research
09/11/2023

SparseSwin: Swin Transformer with Sparse Transformer Block

Advancements in computer vision research have put transformer architectu...
research
12/21/2021

MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

ViTs are often too computationally expensive to be fitted onto real-worl...
research
07/05/2023

MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

The input tokens to Vision Transformers carry little semantic meaning as...
research
11/14/2022

CabViT: Cross Attention among Blocks for Vision Transformer

Since the vision transformer (ViT) has achieved impressive performance i...
research
03/08/2022

Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4

Modern high-scoring models of vision in the brain score competition do n...

Please sign up or login with your details

Forgot password? Click here to reset