Grounding of Textual Phrases in Images by Reconstruction

11/12/2015
by   Anna Rohrbach, et al.
0

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the effectiveness of our approach on the Flickr 30k Entities and ReferItGame datasets with different levels of supervision, ranging from no supervision over partial supervision to full supervision. Our supervised variant improves by a large margin over the state-of-the-art on both datasets.

READ FULL TEXT

page 2

page 13

page 14

research
12/07/2018

PIRC Net : Using Proposal Indexing, Relationships and Context for Phrase Grounding

Phrase Grounding aims to detect and localize objects in images that are ...
research
06/06/2020

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Grounding free-form textual queries necessitates an understanding of the...
research
03/27/2019

Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

We address the problem of grounding free-form textual phrases by using w...
research
05/03/2017

Weakly-supervised Visual Grounding of Phrases with Linguistic Structures

We propose a weakly-supervised approach that takes image-sentence pairs ...
research
06/01/2019

Learning to Generate Grounded Image Captions without Localization Supervision

When generating a sentence description for an image, it frequently remai...
research
11/21/2016

Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

This paper presents a framework for localization or grounding of phrases...
research
08/11/2021

A Better Loss for Visual-Textual Grounding

Given a textual phrase and an image, the visual grounding problem is def...

Please sign up or login with your details

Forgot password? Click here to reset