ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

12/03/2019
by   Mohit Shridhar, et al.
5

We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. Long composition rollouts with non-reversible state changes are among the phenomena we include to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like "Rinse off a mug and place it in the coffee maker." and low-level language instructions like "Walk to the coffee maker on the right." ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets. We show that a baseline model designed for recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

READ FULL TEXT

page 1

page 3

page 11

page 12

page 13

page 14

page 15

page 16

research
11/20/2017

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

A robot that can carry out a natural-language instruction has been a dre...
research
12/06/2021

CALVIN: A Benchmark for Language-conditioned Policy Learning for Long-horizon Robot Manipulation Tasks

General-purpose robots coexisting with humans in their environment must ...
research
01/19/2021

A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment

In this paper we propose a new framework - MoViLan (Modular Vision and L...
research
04/09/2023

ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes

Understanding the continuous states of objects is essential for task lea...
research
10/16/2019

Conditional Driving from Natural Language Instructions

Widespread adoption of self-driving cars will depend not only on their s...
research
10/20/2021

SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark

Existing work in language grounding typically study single environments....
research
09/29/2020

Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions

The recently proposed ALFRED challenge task aims for a virtual robotic a...

Please sign up or login with your details

Forgot password? Click here to reset