Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers

by   Tobias Christian Nauen, et al.

The growing popularity of Vision Transformers as the go-to models for image classification has led to an explosion of architectural modifications claiming to be more efficient than the original ViT. However, a wide diversity of experimental conditions prevents a fair comparison between all of them, based solely on their reported results. To address this gap in comparability, we conduct a comprehensive analysis of more than 30 models to evaluate the efficiency of vision transformers and related architectures, considering various performance metrics. Our benchmark provides a comparable baseline across the landscape of efficiency-oriented transformers, unveiling a plethora of surprising insights. For example, we discover that ViT is still Pareto optimal across multiple efficiency metrics, despite the existence of several alternative approaches claiming to be more efficient. Results also indicate that hybrid attention-CNN models fare particularly well when it comes to low inference memory and number of parameters, and also that it is better to scale the model size, than the image size. Furthermore, we uncover a strong positive correlation between the number of FLOPS and the training memory, which enables the estimation of required VRAM from theoretical measurements alone. Thanks to our holistic evaluation, this study offers valuable insights for practitioners and researchers, facilitating informed decisions when selecting models for specific applications. We publicly release our code and data at https://github.com/tobna/WhatTransformerToFavor


page 1

page 6

page 7


Vision Transformers in 2022: An Update on Tiny ImageNet

The recent advances in image transformers have shown impressive results ...

Efficiency 360: Efficient Vision Transformers

Transformers are widely used for solving tasks in natural language proce...

MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation

We release MiDaS v3.1 for monocular depth estimation, offering a variety...

Fair Comparison between Efficient Attentions

Transformers have been successfully used in various fields and are becom...

Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction

Recently, vision transformers (ViT) have replaced convolutional neural n...

Holistically Explainable Vision Transformers

Transformers increasingly dominate the machine learning landscape across...

Mnemosyne: Learning to Train Transformers with Transformers

Training complex machine learning (ML) architectures requires a compute ...

Please sign up or login with your details

Forgot password? Click here to reset