ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer

by   Youwei Pang, et al.

Deep learning (DL) has advanced the field of dense prediction, while gradually dissolving the inherent barriers between different tasks. However, most existing works focus on designing architectures and constructing visual cues only for the specific task, which ignores the potential uniformity introduced by the DL paradigm. In this paper, we attempt to construct a novel ComPlementary transformer, ComPtr, for diverse bi-source dense prediction tasks. Specifically, unlike existing methods that over-specialize in a single task or a subset of tasks, ComPtr starts from the more general concept of bi-source dense prediction. Based on the basic dependence on information complementarity, we propose consistency enhancement and difference awareness components with which ComPtr can evacuate and collect important visual semantic cues from different image sources for diverse tasks, respectively. ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer. This task-generic design provides a smooth foundation for constructing the unified model that can simultaneously deal with various bi-source information. In extensive experiments across several representative vision tasks, i.e. remote sensing change detection, RGB-T crowd counting, RGB-D/T salient object detection, and RGB-D semantic segmentation, the proposed method consistently obtains favorable performance. The code will be available at <>.


page 1

page 3

page 4

page 8

page 10

page 12


Self-Pair: Synthesizing Changes from Single Source for Object Change Detection in Remote Sensing Imagery

For change detection in remote sensing, constructing a training dataset ...

CycleMLP: A MLP-like Architecture for Dense Prediction

This paper presents a simple MLP-like architecture, CycleMLP, which is a...

BiFormer: Vision Transformer with Bi-Level Routing Attention

As the core building block of vision transformers, attention is a powerf...

Bilevel Generative Learning for Low-Light Vision

Recently, there has been a growing interest in constructing deep learnin...

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

We present DFormer, a novel RGB-D pretraining framework to learn transfe...

Memory-and-Anticipation Transformer for Online Action Understanding

Most existing forecasting systems are memory-based methods, which attemp...

Where in the World is this Image? Transformer-based Geo-localization in the Wild

Predicting the geographic location (geo-localization) from a single grou...

Please sign up or login with your details

Forgot password? Click here to reset