3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation

by   Changli Wu, et al.

In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions. However, this conventional paradigm encounters significant challenges, most notably in terms of the generation of lackluster initial proposals and a pronounced deceleration in inference speed. Recognizing these limitations, we introduce an innovative end-to-end Superpoint-Text Matching Network (3D-STMN) that is enriched by dependency-driven insights. One of the keystones of our model is the Superpoint-Text Matching (STM) mechanism. Unlike traditional methods that navigate through instance proposals, STM directly correlates linguistic indications with their respective superpoints, clusters of semantically related points. This architectural decision empowers our model to efficiently harness cross-modal semantic relationships, primarily leveraging densely annotated superpoint-text pairs, as opposed to the more sparse instance-text pairs. In pursuit of enhancing the role of text in guiding the segmentation process, we further incorporate the Dependency-Driven Interaction (DDI) module to deepen the network's semantic comprehension of referring expressions. Using the dependency trees as a beacon, this module discerns the intricate relationships between primary terms and their associated descriptors in expressions, thereby elevating both the localization and segmentation capacities of our model. Comprehensive experiments on the ScanRefer benchmark reveal that our model not only set new performance standards, registering an mIoU gain of 11.7 points but also achieve a staggering enhancement in inference speed, surpassing traditional methods by 95.7 times. The code and models are available at https://github.com/sosppxo/3D-STMN.


page 3

page 7


EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation

Audio-guided Video Object Segmentation (A-VOS) and Referring Video Objec...

Relationship-Embedded Representation Learning for Grounding Referring Expressions

Grounding referring expressions in images aims to locate the object inst...

Scene Text Retrieval via Joint Text Detection and Similarity Learning

Scene text retrieval aims to localize and search all text instances from...

Side Adapter Network for Open-Vocabulary Semantic Segmentation

This paper presents a new framework for open-vocabulary semantic segment...

Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

This paper proposes an approach to Dense Video Captioning (DVC) without ...

USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval

As a fundamental and challenging task in bridging language and vision do...

A Real-time Global Inference Network for One-stage Referring Expression Comprehension

Referring Expression Comprehension (REC) is an emerging research spot in...

Please sign up or login with your details

Forgot password? Click here to reset