Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics

by   Arnav Varma, et al.

The advent of autonomous driving and advanced driver assistance systems necessitates continuous developments in computer vision for 3D scene understanding. Self-supervised monocular depth estimation, a method for pixel-wise distance estimation of objects from a single camera without the use of ground truth labels, is an important task in 3D scene understanding. However, existing methods for this task are limited to convolutional neural network (CNN) architectures. In contrast with CNNs that use localized linear operations and lose feature resolution across the layers, vision transformers process at constant resolution with a global receptive field at every stage. While recent works have compared transformers against their CNN counterparts for tasks such as image classification, no study exists that investigates the impact of using transformers for self-supervised monocular depth estimation. Here, we first demonstrate how to adapt vision transformers for self-supervised monocular depth estimation. Thereafter, we compare the transformer and CNN-based architectures for their performance on KITTI depth prediction benchmarks, as well as their robustness to natural corruptions and adversarial attacks, including when the camera intrinsics are unknown. Our study demonstrates how transformer-based architecture, though lower in run-time efficiency, achieves comparable performance while being more robust and generalizable.


page 2

page 7

page 8

page 9


MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

Self-supervised monocular depth estimation is an attractive solution tha...

Image Masking for Robust Self-Supervised Monocular Depth Estimation

Self-supervised monocular depth estimation is a salient task for 3D scen...

Depth Estimation with Simplified Transformer

Transformer and its variants have shown state-of-the-art results in many...

MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation

We release MiDaS v3.1 for monocular depth estimation, offering a variety...

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

Monocular depth estimation is an ongoing challenge in computer vision. R...

OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments

We present OffRoadTranSeg, the first end-to-end framework for semi-super...

Adversarial Training of Self-supervised Monocular Depth Estimation against Physical-World Attacks

Monocular Depth Estimation (MDE) is a critical component in applications...

Please sign up or login with your details

Forgot password? Click here to reset