Make A Long Image Short: Adaptive Token Length for Vision Transformers

by   Yichen Zhu, et al.

The vision transformer splits each image into a sequence of tokens with fixed length and processes the tokens in the same way as words in natural language processing. More tokens normally lead to better performance but considerably increased computational cost. Motivated by the proverb "A picture is worth a thousand words" we aim to accelerate the ViT model by making a long image short. To this end, we propose a novel approach to assign token length adaptively during inference. Specifically, we first train a ViT model, called Resizable-ViT (ReViT), that can process any given input with diverse token lengths. Then, we retrieve the "token-length label" from ReViT and use it to train a lightweight Token-Length Assigner (TLA). The token-length labels are the smallest number of tokens to split an image that the ReViT can make the correct prediction, and TLA is learned to allocate the optimal token length based on these labels. The TLA enables the ReViT to process the image with the minimum sufficient number of tokens during inference. Thus, the inference speed is boosted by reducing the token numbers in the ViT model. Our approach is general and compatible with modern vision transformer architectures and can significantly reduce computational expanse. We verified the effectiveness of our methods on multiple representative ViT models (DeiT, LV-ViT, and TimesFormer) across two tasks (image classification and action recognition).


page 1

page 7


Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Vision Transformers (ViT) have achieved remarkable success in large-scal...

Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Transformer models are foundational to natural language processing (NLP)...

Landmark Attention: Random-Access Infinite Context Length for Transformers

While transformers have shown remarkable success in natural language pro...

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but r...

AdapLeR: Speeding up Inference by Adaptive Length Reduction

Pre-trained language models have shown stellar performance in various do...

CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Recognizing an image and segmenting it into coherent regions are often t...

Fast Lexically Constrained Viterbi Algorithm (FLCVA): Simultaneous Optimization of Speed and Memory

Lexical constraints on the input of speech and on-line handwriting syste...

Please sign up or login with your details

Forgot password? Click here to reset