CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

by   Linli Yao, et al.

Automatically generating textual descriptions for massive unlabeled images on the web can greatly benefit realistic web applications, e.g. multimodal retrieval and recommendation. However, existing models suffer from the problem of generating “over-generic” descriptions, such as their tendency to generate repetitive sentences with common concepts for different images. These generic descriptions fail to provide sufficient textual semantics for ever-changing web images. Inspired by the recent success of Vision-Language Pre-training (VLP) models that learn diverse image-text concept alignment during pretraining, we explore leveraging their cross-modal pre-trained knowledge to automatically enrich the textual semantics of image descriptions. With no need for additional human annotations, we propose a plug-and-play framework, i.e CapEnrich, to complement the generic image descriptions with more semantic details. Specifically, we first propose an automatic data-building strategy to get desired training sentences, based on which we then adopt prompting strategies, i.e. learnable and template prompts, to incentivize VLP models to generate more textual details. For learnable templates, we fix the whole VLP model and only tune the prompt vectors, which leads to two advantages: 1) the pre-training knowledge of VLP models can be reserved as much as possible to describe diverse visual concepts; 2) only lightweight trainable parameters are required, so it is friendly to low data resources. Extensive experiments show that our method significantly improves the descriptiveness and diversity of generated sentences for web images. The code is available at


page 1

page 4

page 5

page 8

page 11


Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Despite the recent developments in the field of cross-modal retrieval, t...

Image Aesthetics Assessment via Learnable Queries

Image aesthetics assessment (IAA) aims to estimate the aesthetics of ima...

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

Recent Vision-Language Pre-trained (VLP) models based on dual encoder ha...

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Multimodal pre-training has propelled great advancement in vision-and-la...

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

The pre-trained image-text models, like CLIP, have demonstrated the stro...

VLG: General Video Recognition with Web Textual Knowledge

Video recognition in an open and dynamic world is quite challenging, as ...

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

Current captioning approaches tend to generate correct but "generic" des...

Please sign up or login with your details

Forgot password? Click here to reset