CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

by   Runtao Liu, et al.

Referring object detection and referring image segmentation are important tasks that require joint understanding of visual information and natural language. Yet there has been evidence that current benchmark datasets suffer from bias, and current state-of-the-art models cannot be easily evaluated on their intermediate reasoning process. To address these issues and complement similar efforts in visual question answering, we build CLEVR-Ref+, a synthetic diagnostic dataset for referring expression comprehension. The precise locations and attributes of the objects are readily available, and the referring expressions are automatically associated with functional programs. The synthetic nature allows control over dataset bias (through sampling strategy), and the modular programs enable intermediate reasoning ground truth without human annotators. In addition to evaluating several state-of-the-art models on CLEVR-Ref+, we also propose IEP-Ref, a module network approach that significantly outperforms other models on our dataset. In particular, we present two interesting and important findings using IEP-Ref: (1) the module trained to transform feature maps into segmentation masks can be attached to any intermediate module to reveal the entire reasoning process step-by-step; (2) even if all training data has at least one object referred, IEP-Ref can correctly predict no-foreground when presented with false-premise referring expressions. To the best of our knowledge, this is the first direct and quantitative proof that neural modules behave in the way they are intended.


page 2

page 6

page 7

page 8

page 13

page 14


One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning

Referring Expression Comprehension (REC) is one of the most important ta...

Neural Module Networks for Reasoning over Text

Answering compositional questions that require multiple steps of reasoni...

Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension

Referring expression comprehension (REF) aims at identifying a particula...

Spatial Knowledge Distillation to aid Visual Reasoning

For tasks involving language and vision, the current state-of-the-art me...

Cascaded Mutual Modulation for Visual Reasoning

Visual reasoning is a special visual question answering problem that is ...

Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

Visual question answering requires high-order reasoning about an image, ...

CLOSURE: Assessing Systematic Generalization of CLEVR Models

The CLEVR dataset of natural-looking questions about 3D-rendered scenes ...

Please sign up or login with your details

Forgot password? Click here to reset