Context-Aware Visual Policy Network for Fine-Grained Image Captioning

06/06/2019
by   Zheng-Jun Zha, et al.
0

With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model -- CAVP and its subsequent language policy network -- can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.

READ FULL TEXT

page 2

page 9

page 10

page 11

page 13

research
08/16/2018

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

Many vision-language tasks can be reduced to the problem of sequence pre...
research
04/21/2020

ParaCNN: Visual Paragraph Generation via Adversarial Twin Contextual CNNs

Image description generation plays an important role in many real-world ...
research
03/28/2016

Generating Visual Explanations

Clearly explaining a rationale for a classification decision to an end-u...
research
08/02/2021

Distributed Attention for Grounded Image Captioning

We study the problem of weakly supervised grounded image captioning. Tha...
research
05/15/2019

Aligning Visual Regions and Textual Concepts: Learning Fine-Grained Image Representations for Image Captioning

In image-grounded text generation, fine-grained representations of the i...
research
10/14/2019

Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee

In this paper, we investigate a novel problem of telling the difference ...
research
11/14/2015

Oracle performance for visual captioning

The task of associating images and videos with a natural language descri...

Please sign up or login with your details

Forgot password? Click here to reset