Cross-Modal Concept Learning and Inference for Vision-Language Models

by   Yi Zhang, et al.
Southern University of Science & Technology

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0 learning and by up to 1.3


page 4

page 5


Unsupervised Prototype Adapter for Vision-Language Models

Recently, large-scale pre-trained vision-language models (e.g. CLIP and ...

BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ...

Modeling Images using Transformed Indian Buffet Processes

Latent feature models are attractive for image modeling, since images ge...

DisCLIP: Open-Vocabulary Referring Expression Generation

Referring Expressions Generation (REG) aims to produce textual descripti...

Finding Experts in Transformer Models

In this work we study the presence of expert units in pre-trained Transf...

Contextual Prompt Learning for Vision-Language Understanding

Recent advances in multimodal learning has resulted in powerful vision-l...

Variational prompt tuning improves generalization of vision-language models

Prompt tuning provides an efficient mechanism to adapt large vision-lang...

Please sign up or login with your details

Forgot password? Click here to reset