Question Type Guided Attention in Visual Question Answering

04/06/2018
by   Yang Shi, et al.
0

Visual Question Answering (VQA) requires integration of feature maps with drastically different structures and focus of the correct regions. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence and naturally cluster into semantically different question types. A lot of previous works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. In this work, we propose Question Type-guided Attention (QTA). It utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks. We experiment with multiple VQA architectures with extensive input ablation studies over the TDIUC dataset and show that QTA systematically improves the performance by more than 5 multiple question type categories such as "Activity Recognition", "Utility" and "Counting" on TDIUC dataset. By adding QTA on the state-of-art model MCB, we achieve 3 extension to predict question types which generalizes QTA to applications that lack of question type, with minimal performance loss.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/23/2020

Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering

Different approaches have been proposed to Visual Question Answering (VQ...
research
08/08/2018

Question-Guided Hybrid Convolution for Visual Question Answering

In this paper, we propose a novel Question-Guided Hybrid Convolution (QG...
research
06/25/2021

A Picture May Be Worth a Hundred Words for Visual Question Answering

How far can we go with textual representations for understanding picture...
research
12/13/2018

Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

Learning effective fusion of multi-modality features is at the heart of ...
research
08/31/2016

Towards Transparent AI Systems: Interpreting Visual Question Answering Models

Deep neural networks have shown striking progress and obtained state-of-...
research
06/20/2017

Compact Tensor Pooling for Visual Question Answering

Performing high level cognitive tasks requires the integration of featur...
research
04/24/2020

Revisiting Modulated Convolutions for Visual Counting and Beyond

This paper targets at visual counting, where the setup is to estimate th...

Please sign up or login with your details

Forgot password? Click here to reset