Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool

by   Feng Liu, et al.

In recent years, visual question answering (VQA) has become topical. The premise of VQA's significance as a benchmark in AI, is that both the image and textual question need to be well understood and mutually grounded in order to infer the correct answer. However, current VQA models perhaps `understand' less than initially hoped, and instead master the easier task of exploiting cues given away in the question and biases in the answer distribution. In this paper we propose the inverse problem of VQA (iVQA). The iVQA task is to generate a question that corresponds to a given image and answer pair. We propose a variational iVQA model that can generate diverse, grammatically correct and content correlated questions that match the given answer. Based on this model, we show that iVQA is an interesting benchmark for visuo-linguistic understanding, and a more challenging alternative to VQA because an iVQA model needs to understand the image better to be successful. As a second contribution, we show how to use iVQA in a novel reinforcement learning framework to diagnose any existing VQA model by way of exposing its belief set: the set of question-answer pairs that the VQA model would predict true for a given image. This provides a completely new window into what VQA models `believe' about images. We show that existing VQA models have more erroneous beliefs than previously thought, revealing their intrinsic weaknesses. Suggestions are then made on how to address these weaknesses going forward.


page 2

page 7

page 10

page 12

page 14

page 16


iVQA: Inverse Visual Question Answering

In recent years, visual question answering (VQA) has become topical as a...

An Analysis of Visual Question Answering Algorithms

In visual question answering (VQA), an algorithm must answer text-based ...

Question-Conditioned Counterfactual Image Generation for VQA

While Visual Question Answering (VQA) models continue to push the state-...

Good, Better, Best: Textual Distractors Generation for Multi-Choice VQA via Policy Gradient

Textual distractors in current multi-choice VQA datasets are not challen...

Visual Question Answering on 360° Images

In this work, we introduce VQA 360, a novel task of visual question answ...

Bridge Damage Cause Estimation Using Multiple Images Based on Visual Question Answering

In this paper, a bridge member damage cause estimation framework is prop...

CARETS: A Consistency And Robustness Evaluative Test Suite for VQA

We introduce CARETS, a systematic test suite to measure consistency and ...

Please sign up or login with your details

Forgot password? Click here to reset