Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem

by   Zheng Wang, et al.

Recent research on the robustness of deep learning has shown that Vision Transformers (ViTs) surpass the Convolutional Neural Networks (CNNs) under some perturbations, e.g., natural corruption, adversarial attacks, etc. Some papers argue that the superior robustness of ViT comes from the segmentation of its input images; others say that the Multi-head Self-Attention (MSA) is the key to preserving the robustness. In this paper, we aim to introduce a principled and unified theoretical framework to investigate such an argument on ViT's robustness. We first theoretically prove that, unlike Transformers in Natural Language Processing, ViTs are Lipschitz continuous. Then we theoretically analyze the adversarial robustness of ViTs from the perspective of the Cauchy Problem, via which we can quantify how the robustness propagates through layers. We demonstrate that the first and last layers are the critical factors to affect the robustness of ViTs. Furthermore, based on our theory, we empirically show that unlike the claims from existing research, MSA only contributes to the adversarial robustness of ViTs under weak adversarial attacks, e.g., FGSM, and surprisingly, MSA actually comprises the model's adversarial robustness under stronger attacks, e.g., PGD attacks.


page 1

page 2

page 3

page 4


Analyzing Adversarial Robustness of Vision Transformers against Spatial and Spectral Attacks

Vision Transformers have emerged as a powerful architecture that can out...

Patch-Fool: Are Vision Transformers Always Robust Against Adversarial Perturbations?

Vision transformers (ViTs) have recently set off a new wave in neural ar...

Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO

This work conducts the first analysis on the robustness against adversar...

Structural Robustness for Deep Learning Architectures

Deep Networks have been shown to provide state-of-the-art performance in...

Are Transformers More Robust Than CNNs?

Transformer emerges as a powerful tool for visual recognition. In additi...

Attacking Compressed Vision Transformers

Vision Transformers are increasingly embedded in industrial systems due ...

Please sign up or login with your details

Forgot password? Click here to reset