Linguistically Driven Graph Capsule Network for Visual Question Reasoning

by   Qingxing Cao, et al.

Recently, studies of visual question answering have explored various architectures of end-to-end networks and achieved promising results on both natural and synthetic datasets, which require explicitly compositional reasoning. However, it has been argued that these black-box approaches lack interpretability of results, and thus cannot perform well on generalization tasks due to overfitting the dataset bias. In this work, we aim to combine the benefits of both sides and overcome their limitations to achieve an end-to-end interpretable structural reasoning for general images without the requirement of layout annotations. Inspired by the property of a capsule network that can carve a tree structure inside a regular convolutional neural network (CNN), we propose a hierarchical compositional reasoning model called the "Linguistically driven Graph Capsule Network", where the compositional process is guided by the linguistic parse tree. Specifically, we bind each capsule in the lowest layer to bridge the linguistic embedding of a single word in the original question with visual evidence and then route them to the same capsule if they are siblings in the parse tree. This compositional process is achieved by performing inference on a linguistically driven conditional random field (CRF) and is performed across multiple graph capsule layers, which results in a compositional reasoning process inside a CNN. Experiments on the CLEVR dataset, CLEVR compositional generation test, and FigureQA dataset demonstrate the effectiveness and composition generalization ability of our end-to-end model.


page 1

page 3

page 9


Interpretable Visual Question Answering by Reasoning on Dependency Trees

Collaborative reasoning for understanding each image-question pair is ve...

Reducing the Dilution: analysis of the information sensitiveness of capsule network and one practical solution

Capsule network has shown various advantages over convolutional neural n...

Compositional Attention Networks for Machine Reasoning

We present the MAC network, a novel fully differentiable neural network ...

Visual Question Reasoning on General Dependency Tree

The collaborative reasoning for understanding each image-question pair i...

Does Visual Pretraining Help End-to-End Reasoning?

We aim to investigate whether end-to-end learning of visual reasoning ca...

Towards Interpretable Reasoning over Paragraph Effects in Situation

We focus on the task of reasoning over paragraph effects in situation, w...

Hierarchical Poset Decoding for Compositional Generalization in Language

We formalize human language understanding as a structured prediction tas...

Please sign up or login with your details

Forgot password? Click here to reset