Learning to Predict the 3D Layout of a Scene

by   Jihao Andreas Lin, et al.

While 2D object detection has improved significantly over the past, real world applications of computer vision often require an understanding of the 3D layout of a scene. Many recent approaches to 3D detection use LiDAR point clouds for prediction. We propose a method that only uses a single RGB image, thus enabling applications in devices or vehicles that do not have LiDAR sensors. By using an RGB image, we can leverage the maturity and success of recent 2D object detectors, by extending a 2D detector with a 3D detection head. In this paper we discuss different approaches and experiments, including both regression and classification methods, for designing this 3D detection head. Furthermore, we evaluate how subproblems and implementation details impact the overall prediction result. We use the KITTI dataset for training, which consists of street traffic scenes with class labels, 2D bounding boxes and 3D annotations with seven degrees of freedom. Our final architecture is based on Faster R-CNN. The outputs of the convolutional backbone are fixed sized feature maps for every region of interest. Fully connected layers within the network head then propose an object class and perform 2D bounding box regression. We extend the network head by a 3D detection head, which predicts every degree of freedom of a 3D bounding box via classification. We achieve a mean average precision of 47.3 intersection over union threshold of 70 benchmark; outperforming previous state-of-the-art single RGB only methods by a large margin.


page 3

page 4

page 7

page 8


Cityscapes 3D: Dataset and Benchmark for 9 DoF Vehicle Detection

Detecting vehicles and representing their position and orientation in th...

Recursive Cross-View: Use Only 2D Detectors to Achieve 3D Object Detection without 3D Annotations

Heavily relying on 3D annotations limits the real-world application of 3...

Bounding Box Disparity: 3D Metrics for Object Detection With Full Degree of Freedom

The most popular evaluation metric for object detection in 2D images is ...

CircleNet: Anchor-free Detection with Circle Representation

Object detection networks are powerful in computer vision, but not neces...

NOTE-RCNN: NOise Tolerant Ensemble RCNN for Semi-Supervised Object Detection

The labeling cost of large number of bounding boxes is one of the main c...

Autoregressive Uncertainty Modeling for 3D Bounding Box Prediction

3D bounding boxes are a widespread intermediate representation in many c...

Accelerated Coordinate Encoding: Learning to Relocalize in Minutes using RGB and Poses

Learning-based visual relocalizers exhibit leading pose accuracy, but re...

Please sign up or login with your details

Forgot password? Click here to reset