Human-level Atari 200x faster

by   Steven Kapturowski, et al.

The task of building general agents that perform well over a wide range of tasks has been an important goal in reinforcement learning since its inception. The problem has been subject of research of a large body of work, with performance frequently measured by observing scores over the wide range of environments contained in the Atari 57 benchmark. Agent57 was the first agent to surpass the human benchmark on all 57 games, but this came at the cost of poor data-efficiency, requiring nearly 80 billion frames of experience to achieve. Taking Agent57 as a starting point, we employ a diverse set of strategies to achieve a 200-fold reduction of experience needed to out perform the human baseline. We investigate a range of instabilities and bottlenecks we encountered while reducing the data regime, and propose effective solutions to build a more robust and efficient agent. We also demonstrate competitive performance with high-performing methods such as Muesli and MuZero. The four key components to our approach are (1) an approximate trust region method which enables stable bootstrapping from the online network, (2) a normalisation scheme for the loss and priorities which improves robustness when learning a set of value functions with a wide range of scales, (3) an improved architecture employing techniques from NFNets in order to leverage deeper networks without the need for normalization layers, and (4) a policy distillation method which serves to smooth out the instantaneous greedy policy overtime.


page 2

page 23

page 24

page 25

page 31

page 35


Neural Episodic Control

Deep reinforcement learning methods attain super-human performance in a ...

Quasi-Newton Trust Region Policy Optimization

We propose a trust region method for policy optimization that employs Qu...

CoBERL: Contrastive BERT for Reinforcement Learning

Many reinforcement learning (RL) agents require a large amount of experi...

Characterization and Identification of Cloudified Mobile Network Performance Bottlenecks

This study is a first attempt to experimentally explore the range of per...

Noisy Networks for Exploration

We introduce NoisyNet, a deep reinforcement learning agent with parametr...

NROWAN-DQN: A Stable Noisy Network with Noise Reduction and Online Weight Adjustment for Exploration

Deep reinforcement learning has been applied more and more widely nowada...

DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction

Deep reinforcement learning can learn effective policies for a wide rang...

Please sign up or login with your details

Forgot password? Click here to reset