Preferential Temporal Difference Learning

by   Nishanth Anand, et al.

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.


Reducing Sampling Error in Batch Temporal Difference Learning

Temporal difference (TD) learning is one of the main foundations of mode...

On the Statistical Benefits of Temporal Difference Learning

Given a dataset on actions and resulting long-term rewards, a direct est...

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a s...

Directly Estimating the Variance of the λ-Return Using Temporal-Difference Methods

This paper investigates estimating the variance of a temporal-difference...

Optimistic Temporal Difference Learning for 2048

Temporal difference (TD) learning and its variants, such as multistage T...

META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning

Temporal-Difference (TD) learning is a standard and very successful rein...

Approximate discounting-free policy evaluation from transient and recurrent states

In order to distinguish policies that prescribe good from bad actions in...

Please sign up or login with your details

Forgot password? Click here to reset