CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Structure for Vision-Language Retrieval

04/15/2023
by   Yang Yang, et al.
0

Current vision-language retrieval aims to perform cross-modal instance search, in which the core idea is to learn the consistent visionlanguage representations. Although the performance of cross-modal retrieval has greatly improved with the development of deep models, we unfortunately find that traditional hard consistency may destroy the original relationships among single-modal instances, leading the performance degradation for single-modal retrieval. To address this challenge, in this paper, we experimentally observe that the vision-language divergence may cause the existence of strong and weak modalities, and the hard cross-modal consistency cannot guarantee that strong modal instances' relationships are not affected by weak modality, resulting in the strong modal instances' relationships perturbed despite learned consistent representations.To this end, we propose a novel and directly Coordinated VisionLanguage Retrieval method (dubbed CoVLR), which aims to study and alleviate the desynchrony problem between the cross-modal alignment and single-modal cluster-preserving tasks. CoVLR addresses this challenge by developing an effective meta-optimization based strategy, in which the cross-modal consistency objective and the intra-modal relation preserving objective are acted as the meta-train and meta-test tasks, thereby CoVLR encourages both tasks to be optimized in a coordinated way. Consequently, we can simultaneously insure cross-modal consistency and intra-modal structure. Experiments on different datasets validate CoVLR can improve single-modal retrieval accuracy whilst preserving crossmodal retrieval capacity compared with the baselines.

READ FULL TEXT

page 7

page 8

research
09/09/2021

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

Pretrained vision-and-language BERTs aim to learn representations that c...
research
12/19/2016

Cross-Modal Manifold Learning for Cross-modal Retrieval

This paper presents a new scalable algorithm for cross-modal similarity ...
research
05/11/2023

Continual Vision-Language Representation Learning with Off-Diagonal Information

This paper discusses the feasibility of continuously training the CLIP m...
research
08/28/2023

Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions

With the exponential surge in diverse multi-modal data, traditional uni-...
research
04/13/2023

Noisy Correspondence Learning with Meta Similarity Correction

Despite the success of multimodal learning in cross-modal retrieval task...
research
11/06/2019

A coupled autoencoder approach for multi-modal analysis of cell types

Recent developments in high throughput profiling of individual neurons h...
research
04/17/2019

Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities

Cross-modal retrieval aims to retrieve relevant data across different mo...

Please sign up or login with your details

Forgot password? Click here to reset