On the Generalization of Multi-modal Contrastive Learning

06/07/2023
by   Qi Zhang, et al.
4

Multi-modal contrastive learning (MMCL) has recently garnered considerable interest due to its superior performance in visual tasks, achieved by embedding multi-modal data, such as visual-language pairs. However, there still lack theoretical understandings of how MMCL extracts useful visual representation from multi-modal pairs, and particularly, how MMCL outperforms previous approaches like self-supervised contrastive learning (SSCL). In this paper, by drawing an intrinsic connection between MMCL and asymmetric matrix factorization, we establish the first generalization guarantees of MMCL for visual downstream tasks. Based on this framework, we further unify MMCL and SSCL by showing that MMCL implicitly performs SSCL with (pseudo) positive pairs induced by text pairs. Through this unified perspective, we characterize the advantage of MMCL by showing that text pairs induce more semantically consistent and diverse positive pairs, which, according to our analysis, provably benefit downstream generalization. Inspired by this finding, we propose CLIP-guided resampling methods to significantly improve the downstream performance of SSCL on ImageNet by leveraging multi-modal information. Code is available at https://github.com/PKU-ML/CLIP-Help-SimCLR.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2020

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Existed pre-training methods either focus on single-modal tasks or multi...
research
04/03/2023

Multi-Modal Representation Learning with Text-Driven Soft Masks

We propose a visual-linguistic representation learning approach within a...
research
03/04/2023

Towards a Unified Theoretical Understanding of Non-contrastive Learning via Rank Differential Mechanism

Recently, a variety of methods under the name of non-contrastive learnin...
research
10/15/2022

How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders

Masked Autoencoders (MAE) based on a reconstruction task have risen to b...
research
04/28/2023

SGAligner : 3D Scene Alignment with Scene Graphs

Building 3D scene graphs has recently emerged as a topic in scene repres...
research
03/31/2022

Video-Text Representation Learning via Differentiable Weak Temporal Alignment

Learning generic joint representations for video and text by a supervise...
research
05/06/2021

Learning Neighborhood Representation from Multi-Modal Multi-Graph: Image, Text, Mobility Graph and Beyond

Recent urbanization has coincided with the enrichment of geotagged data,...

Please sign up or login with your details

Forgot password? Click here to reset