You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding

02/12/2019
by   Chaorui Deng, et al.
0

Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a more useful technique than object detection in practice. Most state-of-the-art methods in VG operate in a two-stage manner, wherein the first stage an object detector is adopted to generate a set of object proposals from the input image and the second stage is simply formulated as a cross-modal matching problem that finds the best match between the language query and all region proposals. This is rather inefficient because there might be hundreds of proposals produced in the first stage that need to be compared in the second stage, not to mention this strategy performs inaccurately. In this paper, we propose an simple, intuitive and much more elegant one-stage detection based method that joints the region proposal and matching stage as a single detection network. The detection is conditioned on the input query with a stack of novel Relation-to-Attention modules that transform the image-to-query relationship to an relation map, which is used to predict the bounding box directly without proposing large numbers of useless region proposals. During the inference, our approach is about 20x 30x faster than previous methods and, remarkably, it achieves 18 state-of-the-art results on several benchmark datasets. We release our code and all the pre-trained models at https://github.com/openblack/rvg.

READ FULL TEXT

page 5

page 8

research
03/04/2022

F2DNet: Fast Focal Detection Network for Pedestrian Detection

Two-stage detectors are state-of-the-art in object detection as well as ...
research
07/12/2022

Dynamic Proposals for Efficient Object Detection

Object detection is a basic computer vision task to loccalize and catego...
research
05/05/2021

Proposal-free One-stage Referring Expression via Grid-Word Cross-Attention

Referring Expression Comprehension (REC) has become one of the most impo...
research
04/13/2022

3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection

3D visual grounding aims to locate the referred target object in 3D poin...
research
01/16/2022

YOLO – You only look 10647 times

With this work we are explaining the "You Only Look Once" (YOLO) single-...
research
05/12/2021

VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching

The prevailing framework for matching multimodal inputs is based on a tw...
research
08/18/2021

Social Fabric: Tubelet Compositions for Video Relation Detection

This paper strives to classify and detect the relationship between objec...

Please sign up or login with your details

Forgot password? Click here to reset