Beyond One-to-One: Rethinking the Referring Image Segmentation

08/26/2023
by   Yutao Hu, et al.
0

Referring image segmentation aims to segment the target object referred by a natural language expression. However, previous methods rely on the strong assumption that one sentence must describe one target in the image, which is often not the case in real-world applications. As a result, such methods fail when the expressions refer to either no objects or multiple objects. In this paper, we address this issue from two perspectives. First, we propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches and enables information flow in two directions. In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target. Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature. In this way, visual features are encouraged to contain the critical semantic information about target entity, which supports the accurate segmentation in the text-to-image decoder in turn. Secondly, we collect a new challenging but realistic dataset called Ref-ZOM, which includes image-text pairs under different settings. Extensive experiments demonstrate our method achieves state-of-the-art performance on different datasets, and the Ref-ZOM-trained model performs well on various types of text inputs. Codes and datasets are available at https://github.com/toggle1995/RIS-DMMI.

READ FULL TEXT

page 1

page 3

page 6

page 8

research
12/27/2022

Position-Aware Contrastive Alignment for Referring Image Segmentation

Referring image segmentation aims to segment the target object described...
research
01/16/2023

Learning Aligned Cross-modal Representations for Referring Image Segmentation

Referring image segmentation aims to segment the image region of interes...
research
01/30/2020

Dual Convolutional LSTM Network for Referring Image Segmentation

We consider referring image segmentation. It is a problem at the interse...
research
09/20/2022

Towards Robust Referring Image Segmentation

Referring Image Segmentation (RIS) aims to connect image and language vi...
research
08/08/2023

EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation

Audio-guided Video Object Segmentation (A-VOS) and Referring Video Objec...
research
03/26/2022

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

The task of Human-Object Interaction (HOI) detection could be divided in...
research
05/21/2023

Advancing Referring Expression Segmentation Beyond Single Image

Referring Expression Segmentation (RES) is a widely explored multi-modal...

Please sign up or login with your details

Forgot password? Click here to reset