Mr. Right: Multimodal Retrieval on Representation of ImaGe witH Text

by   Cheng-An Hsieh, et al.

Multimodal learning is a recent challenge that extends unimodal learning by generalizing its domain to diverse modalities, such as texts, images, or speech. This extension requires models to process and relate information from multiple modalities. In Information Retrieval, traditional retrieval tasks focus on the similarity between unimodal documents and queries, while image-text retrieval hypothesizes that most texts contain the scene context from images. This separation has ignored that real-world queries may involve text content, image captions, or both. To address this, we introduce Multimodal Retrieval on Representation of ImaGe witH Text (Mr. Right), a novel and comprehensive dataset for multimodal retrieval. We utilize the Wikipedia dataset with rich text-image examples and generate three types of text-based queries with different modality information: text-related, image-related, and mixed. To validate the effectiveness of our dataset, we provide a multimodal training paradigm and evaluate previous text retrieval and image retrieval frameworks. The results show that proposed multimodal retrieval can improve retrieval performance, but creating a well-unified document representation with texts and images is still a challenge. We hope Mr. Right allows us to broaden current retrieval systems better and contributes to accelerating the advancement of multimodal learning in the Information Retrieval.


page 2

page 15

page 17

page 18

page 20

page 21

page 22


Probabilistic Compositional Embeddings for Multimodal Image Retrieval

Existing works in image retrieval often consider retrieving images with ...

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

The milestone improvements brought about by deep representation learning...

Multimodal Neural Databases

The rise in loosely-structured data available through text, images, and ...

The New Modality: Emoji Challenges in Prediction, Anticipation, and Retrieval

Over the past decade, emoji have emerged as a new and widespread form of...

PatentNet: A Large-Scale Incomplete Multiview, Multimodal, Multilabel Industrial Goods Image Database

In deep learning area, large-scale image datasets bring a breakthrough i...

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

Hateful memes pose a unique challenge for current machine learning syste...

Where Does the Performance Improvement Come From? – A Reproducibility Concern about Image-Text Retrieval

This paper seeks to provide the information retrieval community with som...

Please sign up or login with your details

Forgot password? Click here to reset