Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

09/09/2019
by   Kristopher De Asis, et al.
0

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a fixed number of future time steps. To learn the value function for horizon h, these algorithms bootstrap from the value function for horizon h-1, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as "the deadly triad"). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and n-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.

READ FULL TEXT
research
02/05/2019

Separating value functions across time-scales

In many finite horizon episodic reinforcement learning (RL) settings, it...
research
01/05/2022

A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

Estimating value functions is a core component of reinforcement learning...
research
04/15/2021

Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD)

Using insight from numerical approximation of ODEs and the problem formu...
research
12/28/2018

Differential Temporal Difference Learning

Value functions derived from Markov decision processes arise as a centra...
research
05/28/2019

Conditions on Features for Temporal Difference-Like Methods to Converge

The convergence of many reinforcement learning (RL) algorithms with line...
research
01/19/2023

Suboptimality analysis of receding horizon quadratic control with unknown linear systems and its applications in learning-based control

For a receding-horizon controller with a known system and with an approx...
research
07/18/2013

Efficient Reinforcement Learning in Deterministic Systems with Value Function Generalization

We consider the problem of reinforcement learning over episodes of a fin...

Please sign up or login with your details

Forgot password? Click here to reset