A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking

09/05/2023
by   Lorenzo Papa, et al.
0

Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. Their main feature is the capacity to extract global information through the self-attention mechanism, outperforming earlier convolutional neural networks. However, ViT deployment and performance have grown steadily with their size, number of trainable parameters, and operations. Furthermore, self-attention's computational and memory cost quadratically increases with the image resolution. Generally speaking, it is challenging to employ these architectures in real-world applications due to many hardware and environmental restrictions, such as processing and computational capabilities. Therefore, this survey investigates the most efficient methodologies to ensure sub-optimal estimation performances. More in detail, four efficient categories will be analyzed: compact architecture, pruning, knowledge distillation, and quantization strategies. Moreover, a new metric called Efficient Error Rate has been introduced in order to normalize and compare models' features that affect hardware devices at inference time, such as the number of parameters, bits, FLOPs, and model size. Summarizing, this paper firstly mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios. Toward the end of this paper, we also discuss open challenges and promising research directions.

READ FULL TEXT

page 7

page 20

research
05/17/2023

A survey of the Vision Transformers and its CNN-Transformer based Variants

Vision transformers have recently become popular as a possible alternati...
research
09/12/2021

Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Transformers have sprung up in the field of computer vision. In this wor...
research
02/04/2023

Knowledge Distillation in Vision Transformers: A Critical Review

In Natural Language Processing (NLP), Transformers have already revoluti...
research
10/31/2022

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

Transformers have demonstrated a competitive performance across a wide r...
research
12/21/2021

Learned Queries for Efficient Local Attention

Vision Transformers (ViT) serve as powerful vision models. Unlike convol...
research
07/18/2023

Light-Weight Vision Transformer with Parallel Local and Global Self-Attention

While transformer architectures have dominated computer vision in recent...
research
11/07/2021

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Multilayer perceptron (MLP), as the first neural network structure to ap...

Please sign up or login with your details

Forgot password? Click here to reset