MVSFormer: Multi-View Stereo with Pre-trained Vision Transformers and Temperature-based Depth

by   Chenjie Cao, et al.

Feature representation learning is the key recipe for learning-based Multi-View Stereo (MVS). As the common feature extractor of learning-based MVS, vanilla Feature Pyramid Networks (FPN) suffers from discouraged feature representations for reflection and texture-less areas, which limits the generalization of MVS. Even FPNs worked with pre-trained Convolutional Neural Networks (CNNs) fail to tackle these issues. On the other hand, Vision Transformers (ViTs) have achieved prominent success in many 2D vision tasks. Thus we ask whether ViTs can facilitate feature learning in MVS? In this paper, we propose a pre-trained ViT enhanced MVS network called MVSFormer, which can learn more reliable feature representations benefited by informative priors from ViT. Then MVSFormer-P and MVSFormer-H are further proposed with freezed ViT weights and trainable ones respectively. MVSFormer-P is more efficient while MVSFormer-H can achieve superior performance. MVSFormer can be generalized to various input resolutions with the efficient multi-scale training strengthened by gradient accumulation. Moreover, we discuss the merits and drawbacks of classification and regression-based MVS methods, and further propose to unify them with a temperature-based strategy. MVSFormer achieves state-of-the-art performance on the DTU dataset. Particularly, our anonymous submission of MVSFormer is ranked in the Top-1 position on both intermediate and advanced sets of the highly competitive Tanks-and-Temples leaderboard on the day of submission compared with other published works. Codes and models will be released soon.


page 2

page 6

page 16

page 18

page 19

page 20

page 21

page 22


MPViT: Multi-Path Vision Transformer for Dense Prediction

Dense computer vision tasks such as object detection and segmentation re...

WT-MVSNet: Window-based Transformers for Multi-view Stereo

Recently, Transformers were shown to enhance the performance of multi-vi...

Learning from Multi-View Representation for Point-Cloud Pre-Training

A critical problem in the pre-training of 3D point clouds is leveraging ...

Mimetic Initialization of Self-Attention Layers

It is notoriously difficult to train Transformers on small datasets; typ...

How Well Do Vision Transformers (VTs) Transfer To The Non-Natural Image Domain? An Empirical Study Involving Art Classification

Vision Transformers (VTs) are becoming a valuable alternative to Convolu...

Transformer-based stereo-aware 3D object detection from binocular images

Vision Transformers have shown promising progress in various object dete...

Please sign up or login with your details

Forgot password? Click here to reset