Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting

by   Guangxing Han, et al.

We study multimodal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Most of previous works focus on either few-shot or zero-shot object detection, ignoring the complementarity of visual and semantic information. We first show that meta-learning and prompt-based learning, the most commonly-used methods for few-shot learning and zero-shot transferring from pre-trained vision-language models to downstream tasks, are conceptually similar. They both reformulate the objective of downstream tasks the same as the pre-training tasks, and mostly without tuning the parameters of pre-trained models. Based on this observation, we propose to combine meta-learning with prompt-based learning for multimodal FSOD without fine-tuning, by learning transferable class-agnostic multimodal FSOD models over many-shot base classes. Specifically, to better exploit the pre-trained vision-language models, the meta-learning based cross-modal prompting is proposed to generate soft prompts and further used to extract the semantic prototype, conditioned on the few-shot visual examples. Then, the extracted semantic prototype and few-shot visual prototype are fused to generate the multimodal prototype for detection. Our models can efficiently fuse the visual and semantic information at both token-level and feature-level. We comprehensively evaluate the proposed multimodal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.


page 1

page 2

page 3

page 4


CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabi...

Learning a Better Initialization for Soft Prompts via Meta-Learning

Prompt tuning (PT) is an effective approach to adapting pre-trained lang...

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

This work proposes POMP, a prompt pre-training method for vision-languag...

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Multimodal few-shot learning is challenging due to the large domain gap ...

Does language help generalization in vision models?

Vision models trained on multimodal datasets have recently proved very e...

Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models

Prompt learning is a new learning paradigm which reformulates downstream...

OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training

Advancing object detection to open-vocabulary and few-shot transfer has ...

Please sign up or login with your details

Forgot password? Click here to reset