Large-Vocabulary 3D Diffusion Model with Transformer

by   Ziang Cao, et al.
Nanyang Technological University
The Chinese University of Hong Kong

Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model. Notably, there are three major challenges for this large-vocabulary 3D generation: a) the need for expressive yet efficient 3D representation; b) large diversity in geometry and texture across categories; c) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects. 1) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. 2) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality.


page 2

page 7

page 8

page 9

page 15

page 16

page 17

page 18


Towards Open-Vocabulary Video Instance Segmentation

Video Instance Segmentation(VIS) aims at segmenting and categorizing obj...

NAP: Neural 3D Articulation Prior

We propose Neural 3D Articulation Prior (NAP), the first 3D deep generat...

HumanLiff: Layer-wise 3D Human Generation with Diffusion Model

3D human generation from 2D images has achieved remarkable progress thro...

DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data

Denoising diffusion probabilistic models (DDPMs) have been proven capabl...

Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape

Synthesizing novel 3D models that resemble the input example has long be...

3DGen: Triplane Latent Diffusion for Textured Mesh Generation

Latent diffusion models for image generation have crossed a quality thre...

Robust 3D-aware Object Classification via Discriminative Render-and-Compare

In real-world applications, it is essential to jointly estimate the 3D o...

Please sign up or login with your details

Forgot password? Click here to reset