Reducing Sampling Error in Batch Temporal Difference Learning

by   Brahma Pavse, et al.

Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This paper studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given policy from a batch of data. In this batch setting, we show that TD(0) may converge to an inaccurate value function because the update following an action is weighted according to the number of times that action occurred in the batch – not the true probability of the action under the given policy. To address this limitation, we introduce policy sampling error corrected-TD(0) (PSEC-TD(0)). PSEC-TD(0) first estimates the empirical distribution of actions in each state in the batch and then uses importance sampling to correct for the mismatch between the empirical weighting and the correct weighting for updates following each action. We refine the concept of a certainty-equivalence estimate and argue that PSEC-TD(0) is a more data efficient estimator than TD(0) for a fixed batch of data. Finally, we conduct an empirical evaluation of PSEC-TD(0) on three batch value function learning tasks, with a hyperparameter sensitivity analysis, and show that PSEC-TD(0) produces value function estimates with lower mean squared error than TD(0).


page 7

page 28


O^2TD: (Near)-Optimal Off-Policy TD Learning

Temporal difference learning and Residual Gradient methods are the most ...

Preferential Temporal Difference Learning

Temporal-Difference (TD) learning is a general and very useful tool for ...

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

Tree Search (TS) is crucial to some of the most influential successes in...

Batch-iFDD for Representation Expansion in Large MDPs

Matching pursuit (MP) methods are a promising class of feature construct...

Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error

In this work, we study the use of the Bellman equation as a surrogate ob...

Q^ Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison

We prove performance guarantees of two algorithms for approximating Q^ i...

On the Statistical Benefits of Temporal Difference Learning

Given a dataset on actions and resulting long-term rewards, a direct est...

Please sign up or login with your details

Forgot password? Click here to reset