Cross-Domain Image Captioning with Discriminative Finetuning

04/04/2023
by   Roberto Dessì, et al.
0

Neural captioners are typically trained to mimic human-generated references without optimizing for any specific communication goal, leading to problems such as the generation of vague captions. In this paper, we show that fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language that is more informative about image contents. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify such image among a set of candidates. We experiment with the popular ClipCap captioner, also replicating the main results with BLIP. In terms of similarity to ground-truth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, our discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner without finetuning. We further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.

READ FULL TEXT

page 7

page 11

research
09/08/2019

Quality Estimation for Image Captions Based on Large-scale Human Evaluations

Automatic image captioning has improved significantly in the last few ye...
research
06/26/2023

Semi-Supervised Image Captioning with CLIP

Image captioning, a fundamental task in vision-language understanding, s...
research
10/20/2022

Communication breakdown: On the low mutual intelligibility between human and neural captioning

We compare the 0-shot performance of a neural caption-based image retrie...
research
01/11/2017

Context-aware Captions from Context-agnostic Supervision

We introduce an inference technique to produce discriminative context-aw...
research
07/26/2019

Cooperative image captioning

When describing images with natural language, the descriptions can be ma...
research
07/21/2020

Fine-Grained Image Captioning with Global-Local Discriminative Objective

Significant progress has been made in recent years in image captioning, ...
research
08/26/2020

Attr2Style: A Transfer Learning Approach for Inferring Fashion Styles via Apparel Attributes

Popular fashion e-commerce platforms mostly provide details about low-le...

Please sign up or login with your details

Forgot password? Click here to reset