PromptCap: Prompt-Guided Task-Aware Image Captioning

11/15/2022
by   Yushi Hu, et al.
10

Image captioning aims to describe an image with a natural language sentence, allowing powerful language models to understand images. The framework of combining image captioning with language models has been successful on various vision-language tasks. However, an image contains much more information than a single sentence, leading to underspecification of which visual entities should be described in the caption sentence. For example, when performing visual questioning answering (VQA), generic image captions often miss visual details that are essential for the language model to answer correctly. To address this challenge, we propose PromptCap, a captioning model that takes a natural-language prompt to control the contents of the generated caption. The prompt contains a question that the caption should help to answer, and also supports taking auxiliary text inputs such as scene text within the image itself. To finetune a general image caption model for prompt-guided captioning, we propose a pipeline to synthesize and filter training examples with GPT-3 and existing VQA datasets. For evaluation, we start with an existing pipeline in which a language model is prompted with image captions to carry out VQA. With the same language model, a higher QA accuracy shows that our generated captions are more relevant to the question prompts. PromptCap outperforms generic captions by a large margin on a variety of VQA tasks and achieves the state-of-the-art accuracy of 58.8 experiments on WebQA show that PromptCap generalizes well to unseen domains.

READ FULL TEXT

page 1

page 4

page 8

research
06/03/2019

Generating Question Relevant Captions to Aid Visual Question Answering

Visual question answering (VQA) and image captioning require a shared bo...
research
11/17/2020

Structural and Functional Decomposition for Personality Image Captioning in a Communication Game

Personality image captioning (PIC) aims to describe an image with a natu...
research
11/07/2021

Machine-in-the-Loop Rewriting for Creative Image Captioning

Machine-in-the-loop writing aims to enable humans to collaborate with mo...
research
08/18/2020

Linguistically-aware Attention for Reducing the Semantic-Gap in Vision-Language Tasks

Attention models are widely used in Vision-language (V-L) tasks to perfo...
research
10/28/2020

Fusion Models for Improved Visual Captioning

Visual captioning aims to generate textual descriptions given images. Tr...
research
11/09/2020

CapWAP: Captioning with a Purpose

The traditional image captioning task uses generic reference captions to...
research
06/05/2023

Cheap-fake Detection with LLM using Prompt Engineering

The misuse of real photographs with conflicting image captions in news i...

Please sign up or login with your details

Forgot password? Click here to reset