Robustifying Token Attention for Vision Transformers

03/20/2023
by   Yong Guo, et al.
0

Despite the success of vision transformers (ViTs), they still suffer from significant drops in accuracy in the presence of common corruptions, such as noise or blur. Interestingly, we observe that the attention mechanism of ViTs tends to rely on few important tokens, a phenomenon we call token overfocusing. More critically, these tokens are not robust to corruptions, often leading to highly diverging attention patterns. In this paper, we intend to alleviate this overfocusing issue and make attention more stable through two general techniques: First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Specifically, TAP learns average pooling schemes for each token such that the information of potentially important tokens in the neighborhood can adaptively be taken into account. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few by using our Attention Diversification Loss (ADL). We achieve this by penalizing high cosine similarity between the attention vectors of different tokens. In experiments, we apply our methods to a wide range of transformer architectures and improve robustness significantly. For example, we improve corruption robustness on ImageNet-C by 2.4 accuracy by 0.4 finetuning on semantic segmentation tasks, we improve robustness on CityScapes-C by 2.4

READ FULL TEXT

page 5

page 7

page 13

page 14

page 15

page 16

page 17

research
08/03/2023

Dynamic Token-Pass Transformers for Semantic Segmentation

Vision transformers (ViT) usually extract features via forwarding all th...
research
10/08/2021

Token Pooling in Vision Transformers

Despite the recent success in many applications, the high computational ...
research
08/07/2021

PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

In this paper, we observe two levels of redundancies when applying visio...
research
07/14/2022

Forming Trees with Treeformers

Popular models such as Transformers and LSTMs use tokens as its unit of ...
research
06/23/2023

Max-Margin Token Selection in Attention Mechanism

Attention mechanism is a central component of the transformer architectu...
research
06/23/2021

Probabilistic Attention for Interactive Segmentation

We provide a probabilistic interpretation of attention and show that the...
research
08/13/2020

On the Importance of Local Information in Transformer Based Models

The self-attention module is a key component of Transformer-based models...

Please sign up or login with your details

Forgot password? Click here to reset