VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering

by   Yanan Wang, et al.

Visual understanding requires seamless integration between recognition and reasoning: beyond image-level recognition (e.g., detecting objects), systems must perform concept-level reasoning (e.g., inferring the context of objects and intents of people). However, existing methods only model the image-level features, and do not ground them and reason with background concepts such as knowledge graphs (KGs). In this work, we propose a novel visual question answering method, VQA-GNN, which unifies the image-level information and conceptual knowledge to perform joint reasoning of the scene. Specifically, given a question-image pair, we build a scene graph from the image, retrieve a relevant linguistic subgraph from ConceptNet and visual subgraph from VisualGenome, and unify these three graphs and the question into one joint graph, multimodal semantic graph. Our VQA-GNN then learns to aggregate messages and reason across different modalities captured by the multimodal semantic graph. In the evaluation on the VCR task, our method outperforms the previous scene graph-based Trans-VL models by over 4 fuses a Trans-VL further improves the state of the art by 2 of the VCR leaderboard at the time of submission. This result suggests the efficacy of our model in performing conceptual reasoning beyond image-level recognition for visual understanding. Finally, we demonstrate that our model is the first work to provide interpretability across visual and textual knowledge domains for the VQA task.


page 1

page 4

page 8


Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering

Knowledge-based Visual Question Answering (KVQA) requires external knowl...

Visual Query Answering by Entity-Attribute Graph Matching and Reasoning

Visual Query Answering (VQA) is of great significance in offering people...

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Answering questions that require reading texts in an image is challengin...

Integrating Knowledge and Reasoning in Image Understanding

Deep learning based data-driven approaches have been successfully applie...

Lightweight Visual Question Answering using Scene Graphs

Visual question answering (VQA) is a challenging problem in machine perc...

Barlow constrained optimization for Visual Question Answering

Visual question answering is a vision-and-language multimodal task, that...

Learning by Abstraction: The Neural State Machine

We introduce the Neural State Machine, seeking to bridge the gap between...

Please sign up or login with your details

Forgot password? Click here to reset