Explore to Generalize in Zero-Shot RL

by   Ev Zisselman, et al.

We study zero-shot generalization in reinforcement learning - optimizing a policy on a set of training tasks such that it will perform well on a similar but unseen test task. To mitigate overfitting, previous work explored different notions of invariance to the task. However, on problems such as the ProcGen Maze, an adequate solution that is invariant to the task visualization does not exist, and therefore invariance-based approaches fail. Our insight is that learning a policy that explores the domain effectively is harder to memorize than a policy that maximizes reward for a specific task, and therefore we expect such learned behavior to generalize well; we indeed demonstrate this empirically on several domains that are difficult for invariance-based approaches. Our Explore to Generalize algorithm (ExpGen) builds on this insight: We train an additional ensemble of agents that optimize reward. At test time, either the ensemble agrees on an action, and we generalize well, or we take exploratory actions, which are guaranteed to generalize and drive us to a novel part of the state space, where the ensemble may potentially agree again. We show that our approach is the state-of-the-art on several tasks in the ProcGen challenge that have so far eluded effective generalization. For example, we demonstrate a success rate of 82% on the Maze task and 74% on Heist with 200 training levels.


page 2

page 20


Zero-Shot Policy Transfer with Disentangled Task Representation of Meta-Reinforcement Learning

Humans are capable of abstracting various tasks as different combination...

Effect-Invariant Mechanisms for Policy Generalization

Policy learning is an important component of many real-world learning sy...

Domain Adversarial Reinforcement Learning

We consider the problem of generalization in reinforcement learning wher...

Invariant Policy Optimization: Towards Stronger Generalization in Reinforcement Learning

A fundamental challenge in reinforcement learning is to learn policies t...

Learning on the Job: Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision

Learning-based methods in robotics hold the promise of generalization, b...

Programmable Agents

We build deep RL agents that execute declarative programs expressed in f...

Scoring-Aggregating-Planning: Learning task-agnostic priors from interactions and sparse rewards for zero-shot generalization

Humans can learn task-agnostic priors from interactive experience and ut...

Please sign up or login with your details

Forgot password? Click here to reset