Explicit Image Caption Editing

by   Zhen Wang, et al.

Given an image and a reference caption, the image caption editing task aims to correct the misalignment errors and generate a refined caption. However, all existing caption editing works are implicit models, ie, they directly produce the refined captions without explicit connections to the reference captions. In this paper, we introduce a new task: Explicit Caption Editing (ECE). ECE models explicitly generate a sequence of edit operations, and this edit operation sequence can translate the reference caption into a refined one. Compared to the implicit editing, ECE has multiple advantages: 1) Explainable: it can trace the whole editing path. 2) Editing Efficient: it only needs to modify a few words. 3) Human-like: it resembles the way that humans perform caption editing, and tries to keep original sentence structures. To solve this new task, we propose the first ECE model: TIger. TIger is a non-autoregressive transformer-based model, consisting of three modules: Tagger_del, Tagger_add, and Inserter. Specifically, Tagger_del decides whether each word should be preserved or not, Tagger_add decides where to add new words, and Inserter predicts the specific word for adding. To further facilitate ECE research, we propose two new ECE benchmarks by re-organizing two existing datasets, dubbed COCO-EE and Flickr30K-EE, respectively. Extensive ablations on both two benchmarks have demonstrated the effectiveness of TIger.


page 13

page 23

page 25


Show, Edit and Tell: A Framework for Editing Image Captions

Most image captioning frameworks generate captions directly from images,...

Extending Word-Level Quality Estimation for Post-Editing Assistance

We define a novel concept called extended word alignment in order to imp...

Robustness of edited neural networks

Successful deployment in uncertain, real-world environments requires tha...

Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar

We introduce NSEdit (neural-symbolic edit), a novel Transformer-based co...

EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing

We present the first sentence simplification model that learns explicit ...

Manual Post-editing of Automatically Transcribed Speeches from the Icelandic Parliament - Althingi

The design objectives for an automatic transcription system are to produ...

Levenshtein Training for Word-level Quality Estimation

We propose a novel scheme to use the Levenshtein Transformer to perform ...

Please sign up or login with your details

Forgot password? Click here to reset