Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

08/17/2023
by   Zehan Wang, et al.
0

3D scene understanding has gained significant attention due to its wide range of applications. However, existing methods for 3D scene understanding are limited to specific downstream tasks, which hinders their practicality in real-world applications. This paper presents Chat-3D, which combines the 3D visual perceptual ability of pre-trained 3D representations and the impressive reasoning and conversation capabilities of advanced LLMs to achieve the first universal dialogue systems for 3D scenes. Specifically, we align 3D representations into the feature space of LLMs, thus enabling LLMs to perceive the 3D world. Given the scarcity of 3D scene-text data, we propose a three-stage training strategy to efficiently utilize the available data for better alignment. To enhance the reasoning ability and develop a user-friendly interaction scheme, we further construct a high-quality object-centric 3D instruction dataset and design an associated object-centric prompt. Our experiments show that Chat-3D achieves an impressive ability to comprehend diverse instructions for 3D scenes, engage in intricate spatial reasoning, and incorporate external knowledge into its responses. Chat-3D achieves a 75.6 relative score compared with GPT-4 on the constructed instruction dataset.

READ FULL TEXT

page 6

page 12

page 13

page 14

page 15

page 16

page 17

page 20

research
08/08/2023

Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions

Multimodal Large Language Models (MLLMs) have recently sparked significa...
research
01/10/2023

Neural Radiance Field Codebooks

Compositional representations of the world are a promising step towards ...
research
07/04/2023

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Document understanding refers to automatically extract, analyze and comp...
research
05/20/2023

LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-4

Generative Pre-trained Transformer 4 (GPT-4) demonstrates impressive cha...
research
05/28/2023

KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models

Image ad understanding is a crucial task with wide real-world applicatio...
research
10/19/2022

WebtoonMe: A Data-Centric Approach for Full-Body Portrait Stylization

Full-body portrait stylization, which aims to translate portrait photogr...
research
06/30/2020

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint repr...

Please sign up or login with your details

Forgot password? Click here to reset