Modeling Relationships in Referential Expressions with Compositional Modular Networks

11/30/2016
by   Ronghang Hu, et al.
0

People often refer to entities in an image in terms of their relationships with other entities. For example, "the black cat sitting under the table" refers to both a "black cat" entity and its relationship with another "table" entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of categories. In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference end-to-end. Our approach is built around two types of neural modules that inspect local regions and pairwise interactions between regions. We evaluate CMNs on multiple referential expression datasets, outperforming state-of-the-art approaches on all tasks.

READ FULL TEXT

page 1

page 3

page 7

page 9

research
06/11/2018

Interactive Visual Grounding of Referring Expressions for Human-Robot Interaction

This paper presents INGRESS, a robot system that follows human natural l...
research
12/04/2019

Compositional Temporal Visual Grounding of Natural Language Event Descriptions

Temporal grounding entails establishing a correspondence between natural...
research
05/24/2022

Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution

Service robots should be able to interact naturally with non-expert huma...
research
12/05/2017

Grounding Referring Expressions in Images by Variational Context

We focus on grounding (i.e., localizing or linking) referring expression...
research
01/24/2018

MAttNet: Modular Attention Network for Referring Expression Comprehension

In this paper, we address referring expression comprehension: localizing...
research
12/24/2021

Grounding Linguistic Commands to Navigable Regions

Humans have a natural ability to effortlessly comprehend linguistic comm...
research
09/07/2019

Relationships from Entity Stream

Relational reasoning is a central component of intelligent behavior, but...

Please sign up or login with your details

Forgot password? Click here to reset