Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks

08/18/2020
by   Gouthaman KV, et al.
0

Attention models are widely used in Vision-language (V-L) tasks to perform the visual-textual correlation. Humans perform such a correlation with a strong linguistic understanding of the visual world. However, even the best performing attention model in V-L tasks lacks such a high-level linguistic understanding, thus creating a semantic gap between the modalities. In this paper, we propose an attention mechanism - Linguistically-aware Attention (LAT) - that leverages object attributes obtained from generic object detectors along with pre-trained language models to reduce this semantic gap. LAT represents visual and textual modalities in a common linguistically-rich space, thus providing linguistic awareness to the attention process. We apply and demonstrate the effectiveness of LAT in three V-L tasks: Counting-VQA, VQA, and Image captioning. In Counting-VQA, we propose a novel counting-specific VQA model to predict an intuitive count and achieve state-of-the-art results on five datasets. In VQA and Captioning, we show the generic nature and effectiveness of LAT by adapting it into various baselines and consistently improving their performance.

READ FULL TEXT

page 17

page 21

research
11/12/2017

High-Order Attention Models for Visual Question Answering

The quest for algorithms that enable cognitive abilities is an important...
research
11/15/2022

PromptCap: Prompt-Guided Task-Aware Image Captioning

Image captioning aims to describe an image with a natural language sente...
research
07/25/2017

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Top-down visual attention mechanisms have been used extensively in image...
research
02/11/2019

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Many vision and language models suffer from poor visual grounding - ofte...
research
10/05/2020

Attention Guided Semantic Relationship Parsing for Visual Question Answering

Humans explain inter-object relationships with semantic labels that demo...
research
03/23/2023

Top-Down Visual Attention from Analysis by Synthesis

Current attention algorithms (e.g., self-attention) are stimulus-driven ...
research
07/09/2019

Learning by Abstraction: The Neural State Machine

We introduce the Neural State Machine, seeking to bridge the gap between...

Please sign up or login with your details

Forgot password? Click here to reset