Vision Transformers are Robust Learners

by   Sayak Paul, et al.

Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy with better parameter efficiency. Since self-attention helps a model systematically align different components present inside the input data, it leaves grounds to investigate its performance under model robustness benchmarks. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10 which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness. Code for reproducing our experiments is available here:


page 5

page 7

page 8

page 14

page 15

page 16

page 17


Rethinking the Design Principles of Robust Vision Transformer

Recent advances on Vision Transformers (ViT) have shown that self-attent...

Are Vision Transformers Robust to Spurious Correlations?

Deep neural networks may be susceptible to learning spurious correlation...

An Impartial Take to the CNN vs Transformer Robustness Contest

Following the surge of popularity of Transformers in Computer Vision, se...

Are Transformers More Robust Than CNNs?

Transformer emerges as a powerful tool for visual recognition. In additi...

Towards Efficient Adversarial Training on Vision Transformers

Vision Transformer (ViT), as a powerful alternative to Convolutional Neu...

Out of Distribution Performance of State of Art Vision Model

The vision transformer (ViT) has advanced to the cutting edge in the vis...

Learning Diverse Features in Vision Transformers for Improved Generalization

Deep learning models often rely only on a small set of features even whe...

Please sign up or login with your details

Forgot password? Click here to reset