CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels

11/25/2022
by   Siyuan Li, et al.
1

Pre-trained vision-language models like CLIP have recently shown superior performances on various downstream tasks, including image classification and segmentation. However, in fine-grained image re-identification (ReID), the labels are indexes, lacking concrete text descriptions. Therefore, it remains to be determined how such models could be applied to these tasks. This paper first finds out that simply fine-tuning the visual model initialized by the image encoder in CLIP, has already obtained competitive performances in various ReID tasks. Then we propose a two-stage strategy to facilitate a better visual representation. The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID and give them to the text encoder to form ambiguous descriptions. In the first training stage, image and text encoders from CLIP keep fixed, and only the text tokens are optimized from scratch by the contrastive loss computed within a batch. In the second stage, the ID-specific text tokens and their encoder become static, providing constraints for fine-tuning the image encoder. With the help of the designed loss in the downstream task, the image encoder is able to represent data as vectors in the feature embedding accurately. The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks. Code is available at https://github.com/Syliz517/CLIP-ReID.

READ FULL TEXT

page 2

page 7

page 10

research
08/17/2022

Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model

With the emergence of large pre-trained vison-language model like CLIP, ...
research
05/30/2023

ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models

Large pre-trained vision-language models have shown great prominence in ...
research
11/12/2022

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

In this work, we present a conceptually simple and effective method to t...
research
09/16/2023

Delving into Multimodal Prompting for Fine-grained Visual Classification

Fine-grained visual classification (FGVC) involves categorizing fine sub...
research
07/26/2022

V^2L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval

Product retrieval is of great importance in the ecommerce domain. This p...
research
02/22/2023

Distribution Normalization: An "Effortless" Test-Time Augmentation for Contrastively Learned Visual-language Models

Advances in the field of visual-language contrastive learning have made ...
research
07/21/2023

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Contrastive pretrained large Vision-Language Models (VLMs) like CLIP hav...

Please sign up or login with your details

Forgot password? Click here to reset