CyCLIP: Cyclic Contrastive Language-Image Pretraining

05/28/2022
by   Shashank Goel, et al.
0

Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness. Such models typically require joint reasoning in the image and text representation spaces for downstream inference tasks. Contrary to prior beliefs, we demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions. To mitigate this issue, we formalize consistency and propose CyCLIP, a framework for contrastive representation learning that explicitly optimizes for the learned representations to be geometrically consistent in the image and text space. In particular, we show that consistent representations can be learned by explicitly symmetrizing (a) the similarity between the two mismatched image-text pairs (cross-modal consistency); and (b) the similarity between the image-image pair and the text-text pair (in-modal consistency). Empirically, we show that the improved consistency in CyCLIP translates to significant gains over CLIP, with gains ranging from 10 accuracy on standard benchmarks (CIFAR-10, CIFAR-100, ImageNet1K) and 10 for robustness to various natural distribution shifts. The code is available at https://github.com/goel-shashank/CyCLIP.

READ FULL TEXT
research
12/01/2022

Finetune like you pretrain: Improved finetuning of zero-shot vision models

Finetuning image-text models such as CLIP achieves state-of-the-art accu...
research
06/22/2022

Prototypical Contrastive Language Image Pretraining

Contrastive Language Image Pretraining (CLIP) received widespread attent...
research
02/23/2023

Learning Visual Representations via Language-Guided Sampling

Although an object may appear in numerous contexts, we often describe it...
research
04/07/2022

Unified Contrastive Learning in Image-Text-Label Space

Visual recognition is recently learned via either supervised learning on...
research
02/05/2023

Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining

Mainstream 3D representation learning approaches are built upon contrast...
research
05/03/2023

Unsupervised Improvement of Audio-Text Cross-Modal Representations

Recent advances in using language models to obtain cross-modal audio-tex...
research
04/10/2022

Robust Cross-Modal Representation Learning with Progressive Self-Distillation

The learning objective of vision-language approach of CLIP does not effe...

Please sign up or login with your details

Forgot password? Click here to reset