Conditions on Features for Temporal Difference-Like Methods to Converge

05/28/2019
by   Marcus Hutter, et al.
4

The convergence of many reinforcement learning (RL) algorithms with linear function approximation has been investigated extensively but most proofs assume that these methods converge to a unique solution. In this paper, we provide a complete characterization of non-uniqueness issues for a large class of reinforcement learning algorithms, simultaneously unifying many counter-examples to convergence in a theoretical framework. We achieve this by proving a new condition on features that can determine whether the convergence assumptions are valid or non-uniqueness holds. We consider a general class of RL methods, which we call natural algorithms, whose solutions are characterized as the fixed point of a projected Bellman equation (when it exists); notably, bootstrapped temporal difference-based methods such as TD(λ) and GTD(λ) are natural algorithms. Our main result proves that natural algorithms converge to the correct solution if and only if all the value functions in the approximation space satisfy a certain shape. This implies that natural algorithms are, in general, inherently prone to converge to the wrong solution for most feature choices even if the value function can be represented exactly. Given our results, we show that state aggregation based features are a safe choice for natural algorithms and we also provide a condition for finding convergent algorithms under other feature constructions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2019

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

We explore fixed-horizon temporal difference (TD) methods, reinforcement...
research
10/30/2010

Predictive State Temporal Difference Learning

We propose a new approach to value function approximation which combines...
research
02/15/2020

Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling

Despite the wide applications of Adam in reinforcement learning (RL), th...
research
12/28/2018

Differential Temporal Difference Learning

Value functions derived from Markov decision processes arise as a centra...
research
02/26/2009

Are Tensor Decomposition Solutions Unique? On the global convergence of HOSVD and ParaFac algorithms

For tensor decompositions such as HOSVD and ParaFac, the objective funct...
research
02/14/2022

Convex Programs and Lyapunov Functions for Reinforcement Learning: A Unified Perspective on the Analysis of Value-Based Methods

Value-based methods play a fundamental role in Markov decision processes...
research
04/15/2021

Predictor-Corrector(PC) Temporal Difference(TD) Learning (PCTD)

Using insight from numerical approximation of ODEs and the problem formu...

Please sign up or login with your details

Forgot password? Click here to reset