Deep Reinforcement Learning at the Edge of the Statistical Precipice

08/30/2021
by   Rishabh Agarwal, et al.
0

Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field.

READ FULL TEXT
research
02/19/2019

Investigating Generalisation in Continuous Deep Reinforcement Learning

Deep Reinforcement Learning has shown great success in a variety of cont...
research
12/10/2019

Measuring the Reliability of Reinforcement Learning Algorithms

Lack of reliability is a well-known issue for reinforcement learning (RL...
research
12/08/2021

A Review for Deep Reinforcement Learning in Atari:Benchmarks, Challenges, and Solutions

The Arcade Learning Environment (ALE) is proposed as an evaluation platf...
research
09/19/2017

Deep Reinforcement Learning that Matters

In recent years, significant progress has been made in solving challengi...
research
06/19/2023

AdaStop: sequential testing for efficient and reliable comparisons of Deep RL Agents

The reproducibility of many experimental results in Deep Reinforcement L...
research
04/16/2020

Analyzing Reinforcement Learning Benchmarks with Random Weight Guessing

We propose a novel method for analyzing and visualizing the complexity o...

Please sign up or login with your details

Forgot password? Click here to reset