Aligning Bag of Regions for Open-Vocabulary Object Detection

02/27/2023
by   Size Wu, et al.
4

Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.

READ FULL TEXT

page 4

page 8

page 10

page 11

page 12

research
11/27/2022

Learning Object-Language Alignments for Open-Vocabulary Object Detection

Existing object detection methods are bounded in a fixed-set vocabulary ...
research
03/28/2022

Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model

Recently, vision-language pre-training shows great potential in open-voc...
research
02/28/2023

Linear Spaces of Meanings: the Compositional Language of VLMs

We investigate compositional structures in vector data embeddings from p...
research
05/30/2023

Scalable Performance Analysis for Vision-Language Models

Joint vision-language models have shown great performance over a diverse...
research
07/07/2022

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Existing open-vocabulary object detectors typically enlarge their vocabu...
research
10/09/2022

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Open-vocabulary semantic segmentation aims to segment an image into sema...
research
12/23/2022

Learning to Detect and Segment for Open Vocabulary Object Detection

Open vocabulary object detection has been greatly advanced by the recent...

Please sign up or login with your details

Forgot password? Click here to reset