Learning Object Detection from Captions via Textual Scene Attributes

09/30/2020
by   Achiya Jerbi, et al.
9

Object detection is a fundamental task in computer vision, requiring large annotated datasets that are difficult to collect, as annotators need to label objects and their bounding boxes. Thus, it is a significant challenge to use cheaper forms of supervision effectively. Recent work has begun to explore image captions as a source for weak supervision, but to date, in the context of object detection, captions have only been used to infer the categories of the objects in the image. In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations. Namely, the text represents a scene of the image, as described recently in the literature. We present a method that uses the attributes in this "textual scene graph" to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets, outperforming recent approaches.

READ FULL TEXT

page 1

page 3

page 4

page 6

research
11/25/2018

Learning to discover and localize visual objects with open vocabulary

To alleviate the cost of obtaining accurate bounding boxes for training ...
research
05/28/2021

Linguistic Structures as Weak Supervision for Visual Scene Graph Generation

Prior work in scene graph generation requires categorical supervision at...
research
07/23/2019

Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection

Learning to localize and name object instances is a fundamental problem ...
research
02/22/2023

The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions

Scientific articles published prior to the "age of digitization" in the ...
research
02/01/2021

Inferring spatial relations from textual descriptions of images

Generating an image from its textual description requires both a certain...
research
09/13/2022

ComplETR: Reducing the cost of annotations for object detection in dense scenes with vision transformers

Annotating bounding boxes for object detection is expensive, time-consum...
research
06/03/2021

GMAIR: Unsupervised Object Detection Based on Spatial Attention and Gaussian Mixture

Recent studies on unsupervised object detection based on spatial attenti...

Please sign up or login with your details

Forgot password? Click here to reset