Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval

09/11/2023
by   Yabing Wang, et al.
0

Current research on cross-modal retrieval is mostly English-oriented, as the availability of a large number of English-oriented human-labeled vision-language corpora. In order to break the limit of non-English labeled data, cross-lingual cross-modal retrieval (CCR) has attracted increasing attention. Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation (MT) to achieve cross-lingual transfer. However, the translated sentences from MT are generally imperfect in describing the corresponding visual contents. Improperly assuming the pseudo-parallel data are correctly correlated will make the networks overfit to the noisy correspondence. Therefore, we propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR. In particular, we quantify the confidence of the sample pair correlation with optimal transport theory from both the cross-lingual and cross-modal views, and design dual-view curriculum learning to dynamically model the transportation costs according to the learning stage of the two views. Extensive experiments are conducted on two multilingual image-text datasets and one video-text dataset, and the results demonstrate the effectiveness and robustness of the proposed method. Besides, our proposed method also shows a good expansibility to cross-lingual image-text baselines and a decent generalization on out-of-domain data.

READ FULL TEXT

page 1

page 3

page 9

research
08/26/2022

Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Despite the recent developments in the field of cross-modal retrieval, t...
research
06/01/2022

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

In this paper, we introduce Cross-View Language Modeling, a simple and e...
research
10/07/2022

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Multilingual text-video retrieval methods have improved significantly in...
research
05/20/2023

Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment

Unpaired cross-lingual image captioning has long suffered from irrelevan...
research
05/13/2023

RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training

Multilingual vision-language (V L) pre-training has achieved remarkabl...
research
10/03/2020

Unsupervised Cross-lingual Image Captioning

Most recent image captioning works are conducted in English as the major...
research
04/13/2023

Noisy Correspondence Learning with Meta Similarity Correction

Despite the success of multimodal learning in cross-modal retrieval task...

Please sign up or login with your details

Forgot password? Click here to reset