Rethinking Benchmarks for Cross-modal Image-text Retrieval

by   Weijing Chen, et al.

Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fine-grained cross-modal semantic matching. With the prevalence of large scale multimodal pretraining models, several state-of-the-art models (e.g. X-VLM) have achieved near-perfect performance on widely-used image-text retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. The reason is that a large amount of images and texts in the benchmarks are coarse-grained. Based on the observation, we renovate the coarse-grained images and texts in the old benchmarks and establish the improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the image side, we enlarge the original image pool by adopting more similar images. On the text side, we propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. Furthermore, we evaluate representative image-text retrieval models on our new benchmarks to demonstrate the effectiveness of our method. We also analyze the capability of models on fine-grained semantic comprehension through extensive experiments. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding, especially in distinguishing attributes of close objects in images. Our code and improved benchmark datasets are publicly available at:, which we hope will inspire further in-depth research on cross-modal retrieval.


page 2

page 5

page 6

page 7

page 9


ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Image-text matching is gaining a leading role among tasks involving the ...

FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval

In this paper, we address the text and image matching in cross-modal ret...

Scene Graph Based Fusion Network For Image-Text Retrieval

A critical challenge to image-text retrieval is how to learn accurate co...

Where Does the Performance Improvement Come From? – A Reproducibility Concern about Image-Text Retrieval

This paper seeks to provide the information retrieval community with som...

Plug-and-Play Regulators for Image-Text Matching

Exploiting fine-grained correspondence and visual-semantic alignments ha...

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Contrastive learning-based video-language representation learning approa...

Text-Based Person Search with Limited Data

Text-based person search (TBPS) aims at retrieving a target person from ...

Please sign up or login with your details

Forgot password? Click here to reset