CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

05/27/2023
by   Dachuan Shi, et al.
0

Vision-language models have achieved tremendous progress far beyond what we ever expected. However, their computational costs and latency are also dramatically growing with rapid development, making model acceleration exceedingly critical for researchers with limited resources and consumers with low-end devices. Although extensively studied for unimodal models, the acceleration for multimodal models, especially the vision-language Transformers, is still relatively under-explored. Accordingly, this paper proposes Cross-Guided Ensemble of Tokens (CrossGET) as a universal vison-language Transformer acceleration framework, which adaptively reduces token numbers during inference via cross-modal guidance on-the-fly, leading to significant model acceleration while keeping high performance. Specifically, the proposed CrossGET has two key designs:1) Cross-Guided Matching and Ensemble. CrossGET incorporates cross-modal guided token matching and ensemble to merge tokens effectively, only introducing cross-modal tokens with negligible extra parameters. 2) Complete-Graph Soft Matching. In contrast to the previous bipartite soft matching approach, CrossGET introduces an efficient and effective complete-graph soft matching policy to achieve more reliable token-matching results. Extensive experiments on various vision-language tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed CrossGET framework. The code will be at https://github.com/sdc17/CrossGET.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/27/2023

PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

Large-scale vision language (VL) models use Transformers to perform cros...
research
04/21/2023

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Although vision transformers (ViTs) have shown promising results in vari...
research
07/20/2023

Learned Thresholds Token Merging and Pruning for Vision Transformers

Vision transformers have demonstrated remarkable success in a wide range...
research
03/06/2021

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Transformer architectures have brought about fundamental changes to comp...
research
05/24/2023

SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models

Despite achieving remarkable performance on various vision-language task...
research
05/17/2023

Incorporating Attribution Importance for Improving Faithfulness Metrics

Feature attribution methods (FAs) are popular approaches for providing i...
research
03/15/2023

Attention-likelihood relationship in transformers

We analyze how large language models (LLMs) represent out-of-context wor...

Please sign up or login with your details

Forgot password? Click here to reset