Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

12/14/2022
by   Hongkuan Zhang, et al.
0

Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring additional training costs. Moreover, the higher model performance on difficult examples and unseen data also demonstrates the generalization ability.

READ FULL TEXT

page 1

page 5

research
11/14/2022

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment

CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero...
research
02/28/2020

Exploring and Distilling Cross-Modal Information for Image Captioning

Recently, attention-based encoder-decoder models have been used extensiv...
research
12/18/2022

Efficient Image Captioning for Edge Devices

Recent years have witnessed the rapid progress of image captioning. Howe...
research
08/05/2023

Improving Generalization of Image Captioning with Unsupervised Prompt Learning

Pretrained visual-language models have demonstrated impressive zero-shot...
research
05/20/2023

Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment

Unpaired cross-lingual image captioning has long suffered from irrelevan...
research
10/22/2021

Exploiting Cross-Modal Prediction and Relation Consistency for Semi-Supervised Image Captioning

The task of image captioning aims to generate captions directly from ima...
research
06/24/2022

Competence-based Multimodal Curriculum Learning for Medical Report Generation

Medical report generation task, which targets to produce long and cohere...

Please sign up or login with your details

Forgot password? Click here to reset