Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

by   David Mascharka, et al.

Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives' outputs enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, we show that these primitives are highly performant, achieving state-of-the-art accuracy of 99.1 that our model is able to effectively learn generalized representations when provided a small amount of data containing novel object attributes. Using the CoGenT generalization task, we show more than a 20 percentage point improvement over the current state of the art.


page 1

page 3

page 5

page 6

page 7

page 8


Interpretable by Design Visual Question Answering

Model interpretability has long been a hard problem for the AI community...

ViperGPT: Visual Inference via Python Execution for Reasoning

Answering visual queries is a complex task that requires both visual pro...

Explainable Neural Computation via Stack Neural Module Networks

In complex inferential tasks like question answering, machine learning m...

Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Explanation and high-order reasoning capabilities are crucial for real-w...

Interpretable Visual Question Answering by Reasoning on Dependency Trees

Collaborative reasoning for understanding each image-question pair is ve...

Learned Image Reasoning Prior Penetrates Deep Unfolding Network for Panchromatic and Multi-Spectral Image Fusion

The success of deep neural networks for pan-sharpening is commonly in a ...

CLEVR-Ref+: Diagnosing Visual Reasoning with Referring Expressions

Referring object detection and referring image segmentation are importan...

Please sign up or login with your details

Forgot password? Click here to reset