Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering

07/13/2021
by   Rajat Koner, et al.
0

Visual Question Answering (VQA) is concerned with answering free-form questions about an image. Since it requires a deep semantic and linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires multi-modal reasoning from both computer vision and natural language processing. We propose Graphhopper, a novel method that approaches the task by integrating knowledge graph reasoning, computer vision, and natural language processing techniques. Concretely, our method is based on performing context-driven, sequential reasoning based on the scene entities and their semantic and spatial relationships. As a first step, we derive a scene graph that describes the objects in the image, as well as their attributes and their mutual relationships. Subsequently, a reinforcement learning agent is trained to autonomously navigate in a multi-hop manner over the extracted scene graph to generate reasoning paths, which are the basis for deriving answers. We conduct an experimental study on the challenging dataset GQA, based on both manually curated and automatically generated scene graphs. Our results show that we keep up with a human performance on manually curated scene graphs. Moreover, we find that Graphhopper outperforms another state-of-the-art scene graph reasoning model on both manually curated and automatically generated scene graphs by a significant margin.

READ FULL TEXT

page 2

page 12

page 13

research
07/02/2020

Scene Graph Reasoning for Visual Question Answering

Visual question answering is concerned with answering free-form question...
research
07/28/2019

An Empirical Study on Leveraging Scene Graphs for Visual Question Answering

Visual question answering (Visual QA) has attracted significant attentio...
research
08/03/2018

Visual Reasoning with Multi-hop Feature Modulation

Recent breakthroughs in computer vision and natural language processing ...
research
11/21/2020

LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering

The predominant approach to visual question answering (VQA) relies on en...
research
06/19/2018

Learning Conditioned Graph Structures for Interpretable Visual Question Answering

Visual Question answering is a challenging problem requiring a combinati...
research
04/20/2021

GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering

Images are more than a collection of objects or attributes – they repres...
research
10/14/2022

SQA3D: Situated Question Answering in 3D Scenes

We propose a new task to benchmark scene understanding of embodied agent...

Please sign up or login with your details

Forgot password? Click here to reset