Learning from Multi-View Representation for Point-Cloud Pre-Training

06/05/2023

∙

A critical problem in the pre-training of 3D point clouds is leveraging massive 2D data. A fundamental challenge is to address the 2D-3D domain gap. This paper proposes a novel approach to point-cloud pre-training that enables learning 3D representations by leveraging pre-trained 2D-based networks. In particular, it avoids overfitting to 2D representations and potentially discarding critical 3D features for 3D recognition tasks. The key to our approach is a novel multi-view representation, which learns a shared 3D feature volume consistent with deep features extracted from multiple 2D camera views. The 2D deep features are regularized using pre-trained 2D networks through the 2D knowledge transfer loss. To prevent the resulting 3D feature representations from discarding 3D signals, we introduce the multi-view consistency loss that forces the projected 2D feature representations to capture pixel-wise correspondences across different views. Such correspondences induce 3D geometry and effectively retain 3D features in the projected 2D features. Experimental results demonstrate that our pre-trained model can be successfully transferred to various downstream tasks, including 3D detection and semantic segmentation, and achieve state-of-the-art performance.

READ FULL TEXT

Learning from Multi-View Representation for Point-Cloud Pre-Training

Self-Supervised Learning with Multi-View Rendering for 3D Point Cloud Analysis

Multi-view Vision-Prompt Fusion Network: Can 2D Pre-trained Model Boost 3D Point Cloud Data-scarce Learning?

Pre-Training by Completing Point Clouds

MVSFormer: Multi-View Stereo with Pre-trained Vision Transformers and Temperature-based Depth

ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection

FAC: 3D Representation Learning via Foreground Aware Feature Contrast

Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition

Learning from Multi-View Representation for Point-Cloud Pre-Training

Related Research

Self-Supervised Learning with Multi-View Rendering for 3D Point Cloud Analysis

Multi-view Vision-Prompt Fusion Network: Can 2D Pre-trained Model Boost 3D Point Cloud Data-scarce Learning?

Pre-Training by Completing Point Clouds

MVSFormer: Multi-View Stereo with Pre-trained Vision Transformers and Temperature-based Depth

ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection

FAC: 3D Representation Learning via Foreground Aware Feature Contrast

Masked Feature Modelling: Feature Masking for the Unsupervised Pre-training of a Graph Attention Network Block for Bottom-up Video Event Recognition