Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions

12/23/2021
by   Brett Daley, et al.
0

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks. Classically, off-policy estimation bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after each action. Many important off-policy algorithms such as Tree Backup and Retrace rely on this mechanism along with differing protocols for truncating ("cutting") the ratios ("traces") to counteract the excessive variance of the IS estimator. Unfortunately, cutting traces on a per-decision basis is not necessarily efficient; once a trace has been cut according to local information, the effect cannot be reversed later, potentially resulting in the premature truncation of estimated returns and slower learning. In the interest of motivating efficient off-policy algorithms, we propose a multistep operator that permits arbitrary past-dependent traces. We prove that our operator is convergent for policy evaluation, and for optimal control when targeting greedy-in-the-limit policies. Our theorems establish the first convergence guarantees for many existing algorithms including Truncated IS, Non-Markov Retrace, and history-dependent TD(λ). Our theoretical results also provide guidance for the development of new algorithms that jointly consider multiple past decisions for better credit assignment and faster learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2023

Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning

Off-policy learning from multistep returns is crucial for sample-efficie...
research
05/17/2019

TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

Off-policy reinforcement learning with eligibility traces is challenging...
research
07/21/2023

Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning

Oftentimes, environments for sequential decision-making problems can be ...
research
08/11/2021

Truncated Emphatic Temporal Difference Methods for Prediction and Control

Emphatic Temporal Difference (TD) methods are a class of off-policy Rein...
research
04/18/2017

Investigating Recurrence and Eligibility Traces in Deep Q-Networks

Eligibility traces in reinforcement learning are used as a bias-variance...
research
06/16/2020

META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning

Temporal-Difference (TD) learning is a standard and very successful rein...
research
02/16/2016

Q(λ) with Off-Policy Corrections

We propose and analyze an alternate approach to off-policy multi-step te...

Please sign up or login with your details

Forgot password? Click here to reset