Enhancing the Role of Context in Region-Word Alignment for Object Detection

03/17/2023
by   Kyle Buettner, et al.
0

Vision-language pretraining to learn a fine-grained, region-word alignment between image-caption pairs has propelled progress in open-vocabulary object detection. We observe that region-word alignment methods are typically used in detection with respect to only object nouns, and the impact of other rich context in captions, such as attributes, is unclear. In this study, we explore how language context affects downstream object detection and propose to enhance the role of context. In particular, we show how to strategically contextualize the grounding pretraining objective for improved alignment. We further hone in on attributes as especially useful object context and propose a novel adjective and noun-based negative sampling strategy for increasing their focus in contrastive learning. Overall, our methods enhance object detection when compared to the state-of-the-art in region-word pretraining. We also highlight the fine-grained utility of an attribute-sensitive model through text-region retrieval and phrase grounding analysis.

READ FULL TEXT
research
12/16/2021

RegionCLIP: Region-based Language-Image Pretraining

Contrastive language-image pretraining (CLIP) using image-text pairs has...
research
04/10/2023

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment

This paper presents DetCLIPv2, an efficient and scalable training framew...
research
12/09/2022

Contrastive View Design Strategies to Enhance Robustness to Domain Shifts in Downstream Object Detection

Contrastive learning has emerged as a competitive pretraining method for...
research
05/11/2023

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a...
research
11/02/2022

P^3OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Inspired by the success of visual-language methods (VLMs) in zero-shot c...
research
09/09/2016

The Role of Context Selection in Object Detection

We investigate the reasons why context in object detection has limited u...
research
05/23/2023

Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

Recent work in vision-and-language pretraining has investigated supervis...

Please sign up or login with your details

Forgot password? Click here to reset