Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

by   Haowei Wang, et al.

In recent years, 3D representation learning has turned to 2D vision-language pre-trained models to overcome data scarcity challenges. However, existing methods simply transfer 2D alignment strategies, aligning 3D representations with single-view 2D images and coarse-grained parent category text. These approaches introduce information degradation and insufficient synergy issues, leading to performance loss. Information degradation arises from overlooking the fact that a 3D representation should be equivalent to a series of multi-view images and more fine-grained subcategory text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space, rather than independently aligning with each modality. In this paper, we propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image. Specifically, a novel Structured Multimodal Organizer (SMO) is proposed to address the information degradation issue, which introduces contiguous multi-view images and hierarchical text to enrich the representation of vision and language modalities. A Joint Multi-modal Alignment (JMA) is designed to tackle the insufficient synergy problem, which models the joint modality by incorporating language knowledge into the visual modality. Extensive experiments on ModelNet40 and ScanObjectNN demonstrate the effectiveness of our proposed method, JM3D, which achieves state-of-the-art performance in zero-shot 3D classification. JM3D outperforms ULIP by approximately 4.3 achieves an improvement of up to 6.5 for zero-shot 3D classification on ModelNet40. The source code and trained models for all our experiments are publicly available at


page 4

page 12


MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

As the size of Large Multi-Modal Models (LMMs) increases consistently, t...

Multi-modal Alignment using Representation Codebook

Aligning signals from different modalities is an important step in visio...

Multi-view Vision-Prompt Fusion Network: Can 2D Pre-trained Model Boost 3D Point Cloud Data-scarce Learning?

Point cloud based 3D deep model has wide applications in many applicatio...

PointCLIP: Point Cloud Understanding by CLIP

Recently, zero-shot and few-shot learning via Contrastive Vision-Languag...

Masked Vision and Language Modeling for Multi-modal Representation Learning

In this paper, we study how to use masked signal modeling in vision and ...

OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

We introduce OpenShape, a method for learning multi-modal joint represen...

ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance

Understanding 3D scenes from multi-view inputs has been proven to allevi...

Please sign up or login with your details

Forgot password? Click here to reset