Beyond Fixation: Dynamic Window Visual Transformer

by   Pengzhen Ren, et al.

Recently, a surge of interest in visual transformers is to reduce the computational cost by limiting the calculation of self-attention to a local window. Most current work uses a fixed single-scale window for modeling by default, ignoring the impact of window size on model performance. However, this may limit the modeling potential of these window-based models for multi-scale information. In this paper, we propose a novel method, named Dynamic Window Vision Transformer (DW-ViT). The dynamic window strategy proposed by DW-ViT goes beyond the model that employs a fixed single window setting. To the best of our knowledge, we are the first to use dynamic multi-scale windows to explore the upper limit of the effect of window settings on model performance. In DW-ViT, multi-scale information is obtained by assigning windows of different sizes to different head groups of window multi-head self-attention. Then, the information is dynamically fused by assigning different weights to the multi-scale window branches. We conducted a detailed performance evaluation on three datasets, ImageNet-1K, ADE20K, and COCO. Compared with related state-of-the-art (SoTA) methods, DW-ViT obtains the best performance. Specifically, compared with the current SoTA Swin Transformers <cit.>, DW-ViT has achieved consistent and substantial improvements on all three datasets with similar parameters and computational costs. In addition, DW-ViT exhibits good scalability and can be easily inserted into any window-based visual transformers.


SimViT: Exploring a Simple Vision Transformer with sliding windows

Although vision Transformers have achieved excellent performance as back...

Multi-Scale Self-Attention for Text Classification

In this paper, we introduce the prior knowledge, multi-scale structure, ...

Transformer Tracking with Cyclic Shifting Window Attention

Transformer architecture has been showing its great strength in visual o...

MUSE: Multi-Scale Temporal Features Evolution for Knowledge Tracing

Transformer based knowledge tracing model is an extensively studied prob...

Raw Produce Quality Detection with Shifted Window Self-Attention

Global food insecurity is expected to worsen in the coming decades with ...

Robust Degraded Face Recognition Using Enhanced Local Frequency Descriptor and Multi-scale Competition

Recognizing degraded faces from low resolution and blurred images are co...

Differentiable Window for Dynamic Local Attention

We propose Differentiable Window, a new neural module and general purpos...

Please sign up or login with your details

Forgot password? Click here to reset