Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions

06/09/2023
by   Ian Huang, et al.
0

What constitutes the "vibe" of a particular scene? What should one find in "a busy, dirty city street", "an idyllic countryside", or "a crime scene in an abandoned living room"? The translation from abstract scene descriptions to stylized scene elements cannot be done with any generality by extant systems trained on rigid and limited indoor datasets. In this paper, we propose to leverage the knowledge captured by foundation models to accomplish this translation. We present a system that can serve as a tool to generate stylized assets for 3D scenes described by a short phrase, without the need to enumerate the objects to be found within the scene or give instructions on their appearance. Additionally, it is robust to open-world concepts in a way that traditional methods trained on limited data are not, affording more creative freedom to the 3D artist. Our system demonstrates this using a foundation model "team" composed of a large language model, a vision-language model and several image diffusion models, which communicate using an interpretable and user-editable intermediate representation, thus allowing for more versatile and controllable stylized asset generation for 3D artists. We introduce novel metrics for this task, and show through human evaluations that in 91 cases, our system outputs are judged more faithful to the semantics of the input scene description than the baseline, thus highlighting the potential of this approach to radically accelerate the 3D content creation process for 3D artists.

READ FULL TEXT

page 19

page 21

page 22

page 24

page 26

page 27

page 29

page 30

research
11/09/2022

Understanding Cross-modal Interactions in V L Models that Generate Scene Descriptions

Image captioning models tend to describe images in an object-centric way...
research
09/04/2018

Text2Scene: Generating Abstract Scenes from Textual Descriptions

In this paper, we propose an end-to-end model that learns to interpret n...
research
12/17/2020

SceneFormer: Indoor Scene Generation with Transformers

The task of indoor scene generation is to generate a sequence of objects...
research
06/03/2022

Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning

People say, "A picture is worth a thousand words". Then how can we get t...
research
04/14/2023

FM-Loc: Using Foundation Models for Improved Vision-based Localization

Visual place recognition is essential for vision-based robot localizatio...
research
05/30/2023

DisCLIP: Open-Vocabulary Referring Expression Generation

Referring Expressions Generation (REG) aims to produce textual descripti...
research
06/10/2023

Language-Guided Traffic Simulation via Scene-Level Diffusion

Realistic and controllable traffic simulation is a core capability that ...

Please sign up or login with your details

Forgot password? Click here to reset