Q(λ) with Off-Policy Corrections

02/16/2016
by   Anna Harutyunyan, et al.
0

We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD(λ). We illustrate this theoretical relationship empirically on a continuous-state control task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/23/2017

Consistent On-Line Off-Policy Evaluation

The problem of on-line off-policy evaluation (OPE) has been actively stu...
research
05/17/2019

TBQ(σ): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

Off-policy reinforcement learning with eligibility traces is challenging...
research
07/05/2018

Per-decision Multi-step Temporal Difference Learning with Control Variates

Multi-step temporal difference (TD) learning is an important approach in...
research
03/01/2019

Should All Temporal Difference Learning Use Emphasis?

Emphatic Temporal Difference (ETD) learning has recently been proposed a...
research
07/16/2021

Refined Policy Improvement Bounds for MDPs

The policy improvement bound on the difference of the discounted returns...
research
01/21/2021

Breaking the Deadly Triad with a Target Network

The deadly triad refers to the instability of a reinforcement learning a...
research
12/23/2021

Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions

Off-policy learning from multistep returns is crucial for sample-efficie...

Please sign up or login with your details

Forgot password? Click here to reset