Referring Transformer: A One-step Approach to Multi-task Visual Grounding

06/06/2021
by   Muchen Li, et al.
0

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this paper, we propose a simple one-stage multi-task framework for visual grounding tasks. Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder. In the decoder, the model learns to generate contextualized lingual queries which are then decoded and used to directly regress the bounding box and produce a segmentation mask for the corresponding referred regions. With this simple but highly contextualized model, we outperform state-of-the-arts methods by a large margin on both REC and RES tasks. We also show that a simple pre-training schedule (on an external dataset) further improves the performance. Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.

READ FULL TEXT
research
03/30/2022

SeqTR: A Simple yet Universal Network for Visual Grounding

In this paper, we propose a simple yet universal network termed SeqTR fo...
research
07/27/2022

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

In this paper, we investigate how to achieve better visual grounding wit...
research
03/19/2020

Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

Referring expression comprehension (REC) and segmentation (RES) are two ...
research
05/05/2021

Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention

Referring Expression Comprehension (REC) has become one of the most impo...
research
06/06/2023

Language Adaptive Weight Generation for Multi-task Visual Grounding

Although the impressive performance in visual grounding, the prevailing ...
research
06/02/2022

MMTM: Multi-Tasking Multi-Decoder Transformer for Math Word Problems

Recently, quite a few novel neural architectures were derived to solve m...
research
08/23/2023

A Unified Framework for 3D Point Cloud Visual Grounding

3D point cloud visual grounding plays a critical role in 3D scene compre...

Please sign up or login with your details

Forgot password? Click here to reset