PaLM-E: An Embodied Multimodal Language Model

03/06/2023
by   Danny Driess, et al.
0

Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.

READ FULL TEXT

page 1

page 2

page 8

page 14

page 15

research
08/24/2023

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

We introduce the Qwen-VL series, a set of large-scale vision-language mo...
research
07/28/2023

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

We study how vision-language models trained on Internet-scale data can b...
research
08/31/2023

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Vision-language models (VLMs) have shown powerful capabilities in visual...
research
07/01/2023

DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment

Large language models encode a vast amount of semantic knowledge and pos...
research
12/12/2022

Prompting Is Programming: A Query Language For Large Language Models

Large language models have demonstrated outstanding performance on a wid...
research
07/24/2023

3D-LLM: Injecting the 3D World into Large Language Models

Large language models (LLMs) and Vision-Language Models (VLMs) have been...
research
06/30/2023

Look, Remember and Reason: Visual Reasoning with Grounded Rationales

Large language models have recently shown human level performance on a v...

Please sign up or login with your details

Forgot password? Click here to reset