Focal Modulation Networks

03/22/2022
by   Jianwei Yang, et al.
2

In this work, we propose focal modulation network (FocalNet in short), where self-attention (SA) is completely replaced by a focal modulation module that is more effective and efficient for modeling token interactions. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges at different granularity levels, (ii) gated aggregation to selectively aggregate context features for each visual token (query) based on its content, and (iii) modulation or element-wise affine transformation to fuse the aggregated features into the query vector. Extensive experiments show that FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin Transformers) with similar time and memory cost on the tasks of image classification, object detection, and semantic segmentation. Specifically, our FocalNets with tiny and base sizes achieve 82.3 top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K, it attains 86.5 and 384×384, respectively. FocalNets exhibit remarkable superiority when transferred to downstream tasks. For object detection with Mask R-CNN, our FocalNet base trained with 1× already surpasses Swin trained with 3× schedule (49.0 v.s. 48.5). For semantic segmentation with UperNet, FocalNet base evaluated at single-scale outperforms Swin evaluated at multi-scale (50.5 v.s. 49.7). These results render focal modulation a favorable alternative to SA for effective and efficient visual modeling in real-world applications. Code is available at https://github.com/microsoft/FocalNet.

READ FULL TEXT

page 2

page 16

page 17

page 27

page 28

page 29

research
06/20/2022

Global Context Vision Transformers

We propose global context vision transformer (GC ViT), a novel architect...
research
07/17/2023

Scale-Aware Modulation Meet Transformer

This paper presents a new vision Transformer, Scale-Aware Modulation Tra...
research
11/22/2022

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

This paper does not attempt to design a state-of-the-art method for visu...
research
05/08/2022

ConvMAE: Masked Convolution Meets Masked Autoencoders

Vision Transformers (ViT) become widely-adopted architectures for variou...
research
11/28/2018

ESPNetv2: A Light-weight, Power Efficient, and General Purpose Convolutional Neural Network

We introduce a light-weight, power efficient, and general purpose convol...
research
07/27/2021

Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

Detection transformers have recently shown promising object detection re...
research
09/22/2020

Conditional Sequential Modulation for Efficient Global Image Retouching

Photo retouching aims at enhancing the aesthetic visual quality of image...

Please sign up or login with your details

Forgot password? Click here to reset