GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation Learning

03/16/2023
by   Jiayi Lin, et al.
0

A vision-language foundation model pretrained on very large-scale image-text paired data has the potential to provide generalizable knowledge representation for downstream visual recognition and detection tasks, especially on supplementing the undersampled categories in downstream model training. Recent studies utilizing CLIP for object detection have shown that a two-stage detector design typically outperforms a one-stage detector, while requiring more expensive training resources and longer inference time. In this work, we propose a one-stage detector GridCLIP that narrows its performance gap to those of two-stage detectors, with approximately 43 and 5 times faster than its two-stage counterpart (ViLD) in the training and test process respectively. GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning by expanding the conventional CLIP image-text holistic mapping to a more fine-grained, grid-text alignment. This differs from the region-text mapping in two-stage detectors that apply CLIP directly by treating regions as images. Specifically, GridCLIP performs Grid-level Alignment to adapt the CLIP image-level representations to grid-level representations by aligning to CLIP category representations to learn the annotated (especially frequent) categories. To learn generalizable visual representations of broader categories, especially undersampled ones, we perform Image-level Alignment during training to propagate broad pre-learned categories in the CLIP image encoder from the image-level to the grid-level representations. Experiments show that the learned CLIP-based grid-level representations boost the performance of undersampled (infrequent and novel) categories, reaching comparable detection performance on the LVIS benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Inspired by the success of visual-language methods (VLMs) in zero-shot c...
research
08/07/2023

COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

Vision-Language Pre-training (VLP) methods based on object detection enj...
research
11/16/2022

Region Proposal Network Pre-Training Helps Label-Efficient Object Detection

Self-supervised pre-training, based on the pretext task of instance disc...
research
04/28/2021

Zero-Shot Detection via Vision and Language Knowledge Distillation

Zero-shot image classification has made promising progress by training t...
research
07/06/2020

Learning a Domain Classifier Bank for Unsupervised Adaptive Object Detection

In real applications, object detectors based on deep networks still face...
research
07/25/2022

Few-Shot Object Detection by Knowledge Distillation Using Bag-of-Visual-Words Representations

While fine-tuning based methods for few-shot object detection have achie...
research
09/11/2023

An Effective Two-stage Training Paradigm Detector for Small Dataset

Learning from the limited amount of labeled data to the pre-train model ...

Please sign up or login with your details

Forgot password? Click here to reset