The Local Optimality of Reinforcement Learning by Value Gradients, and its Relationship to Policy Gradient Learning

01/02/2011
by   Michael Fairbank, et al.
0

In this theoretical paper we are concerned with the problem of learning a value function by a smooth general function approximator, to solve a deterministic episodic control problem in a large continuous state space. It is shown that learning the gradient of the value-function at every point along a trajectory generated by a greedy policy is a sufficient condition for the trajectory to be locally extremal, and often locally optimal, and we argue that this brings greater efficiency to value-function learning. This contrasts to traditional value-function learning in which the value-function must be learnt over the whole of state space. It is also proven that policy-gradient learning applied to a greedy policy on a value-function produces a weight update equivalent to a value-gradient weight update, which provides a surprising connection between these two alternative paradigms of reinforcement learning, and a convergence proof for control problems with a value function represented by a general smooth function approximator.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2023

Accelerating Policy Gradient by Estimating Value Function from Prior Computation in Deep Reinforcement Learning

This paper investigates the use of prior computation to estimate the val...
research
12/13/2015

Policy Gradient Methods for Off-policy Control

Off-policy learning refers to the problem of learning the value function...
research
12/31/2019

The Gambler's Problem and Beyond

We analyze the Gambler's problem, a simple reinforcement learning proble...
research
12/01/2021

Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes

Value function approximation is a crucial module for policy evaluation i...
research
09/04/2020

Policy Gradient Reinforcement Learning for Policy Represented by Fuzzy Rules: Application to Simulations of Speed Control of an Automobile

A method of a fusion of fuzzy inference and policy gradient reinforcemen...
research
06/18/2020

Reducing Estimation Bias via Weighted Delayed Deep Deterministic Policy Gradient

The overestimation phenomenon caused by function approximation is a well...
research
07/31/2020

Queueing Network Controls via Deep Reinforcement Learning

Novel advanced policy gradient (APG) methods with conservative policy it...

Please sign up or login with your details

Forgot password? Click here to reset