Leveraging Large Language Models for Robot 3D Scene Understanding

09/12/2022
by   William Chen, et al.
2

Semantic 3D scene understanding is a problem of critical importance in robotics. While significant advances have been made in spatial perception, robots are still far from having the common-sense knowledge about household objects and locations of an average human. We thus investigate the use of large language models to impart common sense for scene understanding. Specifically, we introduce three paradigms for leveraging language for classifying rooms in indoor environments based on their contained objects: (i) a zero-shot approach, (ii) a feed-forward classifier approach, and (iii) a contrastive classifier approach. These methods operate on 3D scene graphs produced by modern spatial perception systems. We then analyze each approach, demonstrating notable zero-shot generalization and transfer capabilities stemming from their use of language. Finally, we show these approaches also apply to inferring building labels from contained rooms and demonstrate our zero-shot approach on a real environment. All code can be found at https://github.com/MIT-SPARK/llm_scene_understanding.

READ FULL TEXT
research
06/09/2022

Extracting Zero-shot Common Sense from Large Language Models for Robot 3D Scene Understanding

Semantic 3D scene understanding is a problem of critical importance in r...
research
07/23/2022

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

We study open-world 3D scene understanding, a family of tasks that requi...
research
05/23/2023

Can Language Models Understand Physical Concepts?

Language models (LMs) gradually become general-purpose interfaces in the...
research
10/05/2022

DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics

We introduce the first work to explore web-scale diffusion models for ro...
research
12/15/2021

Is "my favorite new movie" my favorite movie? Probing the Understanding of Recursive Noun Phrases

Recursive noun phrases (NPs) have interesting semantic properties. For e...
research
04/11/2023

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Visual target navigation in unknown environments is a crucial problem in...
research
08/22/2023

Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts

Cross-scene generalizable NeRF models, which can directly synthesize nov...

Please sign up or login with your details

Forgot password? Click here to reset