Vision Langauge Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

05/08/2023
by   Chaoya Jiang, et al.
0

Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.

READ FULL TEXT

page 8

page 16

page 17

page 18

research
05/09/2023

Exploiting Pseudo Image Captions for Multimodal Summarization

Cross-modal contrastive learning in vision language pretraining (VLP) fa...
research
05/08/2022

Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

Vision-language pre-training (VLP) relying on large-scale pre-training d...
research
11/23/2022

How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?

Various state-of-the-art self-supervised visual representation learning ...
research
08/19/2023

An Empirical Study of CLIP for Text-based Person Search

Text-based Person Search (TBPS) aims to retrieve the person images using...
research
04/28/2022

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

We present an approach to learn voice-face representations from the talk...
research
02/23/2023

Learning Visual Representations via Language-Guided Sampling

Although an object may appear in numerous contexts, we often describe it...
research
06/15/2022

Discrete Contrastive Diffusion for Cross-Modal and Conditional Generation

Diffusion probabilistic models (DPMs) have become a popular approach to ...

Please sign up or login with your details

Forgot password? Click here to reset