A Synchronized Multi-Modal Attention-Caption Dataset and Analysis

03/06/2019
by   Sen He, et al.
0

In this work, we present a novel multi-modal dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences between human attention in free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also compare human and machine attention, in particular the top-down soft attention approach that is argued to mimick human attention, in captioning tasks. Our study reveals that, (1) human attention behaviour in free-viewing is different than image description as humans tend to fixate on a greater variety of regions under the latter task; (2) there is a strong relationship between the described objects and the objects attended by subjects (97% of described objects are being attended); (3) a convolutional neural network as feature encoder captures regions that human attend under image captioning to a great extent (around 78%); (4) the soft-attention as the top-down mechanism does not agree with human attention behaviour neither spatially nor temporally; and (5) soft-attention does not add strong beneficial human-like attention behaviour for the task of captioning as it has low correlation between caption scores and attention consistency scores, indicating a large gap between human and machine in regard to top-down attention.

READ FULL TEXT

page 2

page 4

page 6

page 7

page 8

research
04/24/2017

Paying Attention to Descriptions Generated by Image Captioning Models

To bridge the gap between humans and machines in image understanding and...
research
04/26/2021

Contextualized Keyword Representations for Multi-modal Retinal Image Captioning

Medical image captioning automatically generates a medical description t...
research
03/31/2020

X-Linear Attention Networks for Image Captioning

Recent progress on fine-grained visual recognition and visual question a...
research
11/30/2020

Language-Driven Region Pointer Advancement for Controllable Image Captioning

Controllable Image Captioning is a recent sub-field in the multi-modal t...
research
07/26/2021

Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph

Entity-aware image captioning aims to describe named entities and events...
research
05/31/2016

Attention Correctness in Neural Image Captioning

Attention mechanisms have recently been introduced in deep learning for ...
research
11/09/2020

Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

When speakers describe an image, they tend to look at objects before men...

Please sign up or login with your details

Forgot password? Click here to reset