A Generalist Agent

05/12/2022
by   Scott Reed, et al.
12

Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.

READ FULL TEXT

page 1

page 7

page 8

page 10

page 14

page 33

research
06/15/2023

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Although instruction-tuned large language models (LLMs) have exhibited r...
research
08/03/2022

Masked Vision and Language Modeling for Multi-modal Representation Learning

In this paper, we study how to use masked signal modeling in vision and ...
research
01/19/2022

CM3: A Causal Masked Multimodal Model of the Internet

We introduce CM3, a family of causally masked generative models trained ...
research
07/05/2022

Multi-modal Robustness Analysis Against Language and Visual Perturbations

Joint visual and language modeling on large-scale datasets has recently ...
research
10/18/2022

From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data

While large-scale sequence modeling from offline data has led to impress...
research
05/11/2020

Enabling Language Models to Fill in the Blanks

We present a simple approach for text infilling, the task of predicting ...

Please sign up or login with your details

Forgot password? Click here to reset