Pyramid Fusion Transformer for Semantic Segmentation

01/11/2022
by   Zipeng Qin, et al.
0

The recently proposed MaskFormer <cit.> gives a refreshed perspective on the task of semantic segmentation: it shifts from the popular pixel-level classification paradigm to a mask-level classification method. In essence, it generates paired probabilities and masks corresponding to category segments and combines them during inference for the segmentation maps. The segmentation quality thus relies on how well the queries can capture the semantic information for categories and their spatial locations within the images. In our study, we find that per-mask classification decoder on top of a single-scale feature is not effective enough to extract reliable probability or mask. To mine for rich semantic information across the feature pyramid, we propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation on top of multi-scale features. To efficiently utilize image features of different resolutions without incurring too much computational overheads, PFT uses a multi-scale transformer decoder with cross-scale inter-query attention to exchange complimentary information. Extensive experimental evaluations and ablations demonstrate the efficacy of our framework. In particular, we achieve a 3.2 mIoU improvement on COCO-Stuff 10K dataset with ResNet-101c compared to MaskFormer. Besides, on ADE20K validation set, our result with Swin-B backbone matches that of MaskFormer's with a much larger Swin-L backbone in both single-scale and multi-scale inference, achieving 54.1 mIoU and 55.3 mIoU respectively. Using a Swin-L backbone, we achieve 56.0 mIoU single-scale result on the ADE20K validation set and 57.2 multi-scale result, obtaining state-of-the-art performance on the dataset.

READ FULL TEXT

page 1

page 10

research
01/05/2022

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Multi-scale representations are crucial for semantic segmentation. The c...
research
03/26/2022

Feature Selective Transformer for Semantic Image Segmentation

Recently, it has attracted more and more attentions to fuse multi-scale ...
research
12/06/2022

IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation

Semantic segmentation usually benefits from global contexts, fine locali...
research
05/07/2021

A^2-FPN: Attention Aggregation based Feature Pyramid Network for Instance Segmentation

Learning pyramidal feature representations is crucial for recognizing ob...
research
09/14/2023

Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation

Modern approaches have proved the huge potential of addressing semantic ...
research
12/14/2021

End-to-end speaker diarization with transformer

Speaker diarization is connected to semantic segmentation in computer vi...
research
08/23/2020

Robust Vision Challenge 2020 – 1st Place Report for Panoptic Segmentation

In this technical report, we present key details of our winning panoptic...

Please sign up or login with your details

Forgot password? Click here to reset