Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes

by   Qin Lu, et al.

Value function approximation is a crucial module for policy evaluation in reinforcement learning when the state space is large or continuous. The present paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning, where a Gaussian process (GP) prior is presumed on the sought value function, and instantaneous rewards are probabilistically generated based on value function evaluations at two consecutive states. Capitalizing on a random feature-based approximant of the GP prior, an online scalable (OS) approach, termed OS-GPTD, is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To benchmark the performance of OS-GPTD even in an adversarial setting, where the modeling assumptions are violated, complementary worst-case analyses are performed by upper-bounding the cumulative Bellman error as well as the long-term reward prediction error, relative to their counterparts from a fixed value function estimator with the entire state-reward trajectory in hindsight. Moreover, to alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme, termed OS-EGPTD, that can jointly infer the value function, and select interactively the EGP kernel on-the-fly. Finally, performances of the novel OS-(E)GPTD schemes are evaluated on two benchmark problems.


page 1

page 2

page 3

page 4


The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning

In this theoretical paper we are concerned with the problem of learning ...

Classification-based Approximate Policy Iteration: Experiments and Extended Discussions

Tackling large approximate dynamic programming or reinforcement learning...

In Hindsight: A Smooth Reward for Steady Exploration

In classical Q-learning, the objective is to maximize the sum of discoun...

Average-Reward Off-Policy Policy Evaluation with Function Approximation

We consider off-policy policy evaluation with function approximation (FA...

A Unified Off-Policy Evaluation Approach for General Value Function

General Value Function (GVF) is a powerful tool to represent both the pr...

Incremental Ensemble Gaussian Processes

Belonging to the family of Bayesian nonparametrics, Gaussian process (GP...

Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error

In this work, we study the use of the Bellman equation as a surrogate ob...

Please sign up or login with your details

Forgot password? Click here to reset