Cross-Modal Concept Learning and Inference for Vision-Language Models

07/28/2023
by   Yi Zhang, et al.
0

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0 learning and by up to 1.3

READ FULL TEXT

page 4

page 5

research
08/22/2023

Unsupervised Prototype Adapter for Vision-Language Models

Recently, large-scale pre-trained vision-language models (e.g. CLIP and ...
research
09/03/2023

BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ...
research
06/27/2012

Modeling Images using Transformed Indian Buffet Processes

Latent feature models are attractive for image modeling, since images ge...
research
05/30/2023

DisCLIP: Open-Vocabulary Referring Expression Generation

Referring Expressions Generation (REG) aims to produce textual descripti...
research
05/15/2020

Finding Experts in Transformer Models

In this work we study the presence of expert units in pre-trained Transf...
research
07/03/2023

Contextual Prompt Learning for Vision-Language Understanding

Recent advances in multimodal learning has resulted in powerful vision-l...
research
10/05/2022

Variational prompt tuning improves generalization of vision-language models

Prompt tuning provides an efficient mechanism to adapt large vision-lang...

Please sign up or login with your details

Forgot password? Click here to reset