From Two Graphs to N Questions: A VQA Dataset for Compositional Reasoning on Vision and Commonsense

by   Difei Gao, et al.

Visual Question Answering (VQA) is a challenging task for evaluating the ability of comprehensive understanding of the world. Existing benchmarks usually focus on the reasoning abilities either only on the vision or mainly on the knowledge with relatively simple abilities on vision. However, the ability of answering a question that requires alternatively inferring on the image content and the commonsense knowledge is crucial for an advanced VQA system. In this paper, we introduce a VQA dataset that provides more challenging and general questions about Compositional Reasoning on vIsion and Commonsense, which is named as CRIC. To create this dataset, we develop a powerful method to automatically generate compositional questions and rich annotations from both the scene graph of a given image and some external knowledge graph. Moreover, this paper presents a new compositional model that is capable of implementing various types of reasoning functions on the image content and the knowledge graph. Further, we analyze several baselines, state-of-the-art and our model on CRIC dataset. The experimental results show that the proposed task is challenging, where state-of-the-art obtains 52.26 obtains 58.38


page 1

page 4

page 5

page 6

page 8

page 13

page 14


VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

There has been a growing interest in solving Visual Question Answering (...

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

3D scene understanding is a relatively emerging research field. In this ...

From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering

In order to achieve a general visual question answering (VQA) system, it...

Generating Rationales in Visual Question Answering

Despite recent advances in Visual QuestionAnswering (VQA), it remains a ...

Understanding Knowledge Gaps in Visual Question Answering: Implications for Gap Identification and Testing

Visual Question Answering (VQA) systems are tasked with answering natura...

Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Explanation and high-order reasoning capabilities are crucial for real-w...

Can you even tell left from right? Presenting a new challenge for VQA

Visual Question Answering (VQA) needs a means of evaluating the strength...

Please sign up or login with your details

Forgot password? Click here to reset