Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

by   Shuhuai Ren, et al.

This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 downstream datasets, e.g., 67.0 dataset (+3.1 segmentation (+6.9 compared to ZSSeg).


page 9

page 17

page 18


A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model

Recently, zero-shot image classification by vision-language pre-training...

Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

We study multimodal few-shot object detection (FSOD) in this paper, usin...

Supporting Vision-Language Model Inference with Causality-pruning Knowledge Prompt

Vision-language models are pre-trained by aligning image-text pairs in a...

Vision-Language Models for Vision Tasks: A Survey

Most visual recognition studies rely heavily on crowd-labelled data in d...

OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Advancing object detection to open-vocabulary and few-shot transfer has ...

Rethinking the Openness of CLIP

Contrastive Language-Image Pre-training (CLIP) has demonstrated great po...

Opening Deep Neural Networks with Generative Models

Image classification methods are usually trained to perform predictions ...

Please sign up or login with your details

Forgot password? Click here to reset