Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection

by   Jinhyung Park, et al.

While recent camera-only 3D detection methods leverage multiple timesteps, the limited history they use significantly hampers the extent to which temporal fusion can improve object perception. Observing that existing works' fusion of multi-frame images are instances of temporal stereo matching, we find that performance is hindered by the interplay between 1) the low granularity of matching resolution and 2) the sub-optimal multi-view setup produced by limited history usage. Our theoretical and empirical analysis demonstrates that the optimal temporal difference between views varies significantly for different pixels and depths, making it necessary to fuse many timesteps over long-term history. Building on our investigation, we propose to generate a cost volume from a long history of image observations, compensating for the coarse but efficient matching resolution with a more optimal multi-view matching setup. Further, we augment the per-frame monocular depth predictions used for long-term, coarse matching with short-term, fine-grained matching and find that long and short term temporal fusion are highly complementary. While maintaining high efficiency, our framework sets new state-of-the-art on nuScenes, achieving first place on the test set and outperforming previous best art by 5.2% mAP and 3.7% NDS on the validation set. Code will be released $\href{https://github.com/Divadi/SOLOFusion}{here.}$


page 7

page 19

page 20

page 21


Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection

In this paper, we propose a long-sequence modeling framework, named Stre...

MV-FCOS3D++: Multi-View Camera-Only 4D Object Detection with Pretrained Monocular Backbones

In this technical report, we present our solution, dubbed MV-FCOS3D++, f...

Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Sparse algorithms offer great flexibility for multi-view temporal percep...

Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction

In this paper, we propose a new paradigm, named Historical Object Predic...

MRS-VPR: a multi-resolution sampling based global visual place recognition method

Place recognition and loop closure detection are challenging for long-te...

STS: Surround-view Temporal Stereo for Multi-view 3D Detection

Learning accurate depth is essential to multi-view 3D object detection. ...

Long-term, Short-term and Sudden Event: Trading Volume Movement Prediction with Graph-based Multi-view Modeling

Trading volume movement prediction is the key in a variety of financial ...

Please sign up or login with your details

Forgot password? Click here to reset