ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

by   Shariq F. Bhat, et al.

This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21 is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains. The code and pre-trained models are publicly available at https://github.com/isl-org/ZoeDepth .


Towards Zero-Shot Scale-Aware Monocular Depth Estimation

Monocular depth estimation is scale-ambiguous, and thus requires scale s...

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Reconstructing accurate 3D scenes from images is a long-standing vision ...

Image-Text Pre-Training for Logo Recognition

Open-set logo recognition is commonly solved by first detecting possible...

Kick Back Relax: Learning to Reconstruct the World by Watching SlowTV

Self-supervised monocular depth estimation (SS-MDE) has the potential to...

Monocular Depth Estimation using Diffusion Models

We formulate monocular depth estimation using denoising diffusion models...

Relative Depth Estimation as a Ranking Problem

We present a formulation of the relative depth estimation from a single ...

Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization

Domain generalization (DG) is a difficult transfer learning problem aimi...

Please sign up or login with your details

Forgot password? Click here to reset