Zero-shot Composed Text-Image Retrieval

06/12/2023
by   Yikun Liu, et al.
2

In this paper, we consider the problem of composed image retrieval (CIR), it aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability. We make the following contributions: (i) we initiate a scalable pipeline to automatically construct datasets for training CIR model, by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model, TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively combine information from diverse modalities; (iii) we conduct extensive ablation studies to investigate the usefulness of our proposed data construction procedure, and the effectiveness of core components in TransAgg; (iv) when evaluating on the publicly available benckmarks under the zero-shot scenario, i.e., training on the automatically constructed datasets, then directly conduct inference on target downstream datasets, e.g., CIRR and FashionIQ, our proposed approach either performs on par with or significantly outperforms the existing state-of-the-art (SOTA) models. Project page: https://code-kunkun.github.io/ZS-CIR/

READ FULL TEXT

page 4

page 5

page 10

page 16

research
03/27/2023

Zero-Shot Composed Image Retrieval with Textual Inversion

Composed Image Retrieval (CIR) aims to retrieve a target image based on ...
research
02/06/2023

Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval

In Composed Image Retrieval (CIR), a user combines a query image with te...
research
03/21/2023

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

This paper proposes a novel diffusion-based model, CompoDiff, for solvin...
research
08/28/2023

CoVR: Learning Composed Video Retrieval from Web Video Captions

Composed Image Retrieval (CoIR) has recently gained popularity as a task...
research
07/23/2020

ZSCRGAN: A GAN-based Expectation Maximization Model for Zero-Shot Retrieval of Images from Textual Descriptions

Most existing algorithms for cross-modal Information Retrieval are based...
research
06/13/2023

GeneCIS: A Benchmark for General Conditional Image Similarity

We argue that there are many notions of 'similarity' and that models, li...
research
09/04/2023

NLLB-CLIP – train performant multilingual image retrieval model on a budget

Today, the exponential rise of large models developed by academic and in...

Please sign up or login with your details

Forgot password? Click here to reset