On Wasserstein Reinforcement Learning and the Fokker-Planck equation

12/19/2017
by   Pierre H. Richemond, et al.
0

Policy gradients methods often achieve better performance when the change in policy is limited to a small Kullback-Leibler divergence. We derive policy gradients where the change in policy is limited to a small Wasserstein distance (or trust region). This is done in the discrete and continuous multi-armed bandit settings with entropy regularisation. We show that in the small steps limit with respect to the Wasserstein distance W_2, policy dynamics are governed by the Fokker-Planck (heat) equation, following the Jordan-Kinderlehrer-Otto result. This means that policies undergo diffusion and advection, concentrating near actions with high reward. This helps elucidate the nature of convergence in the probability matching setup, and provides justification for empirical practices such as Gaussian policy priors and additive gradient noise.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/12/2018

A note on reinforcement learning with Wasserstein distance regularisation, with applications to multipolicy learning

In this note we describe an application of Wasserstein distance to Reinf...
research
10/12/2020

Efficient Wasserstein Natural Gradients for Reinforcement Learning

A novel optimization approach is proposed for application to policy grad...
research
09/05/2022

Natural Policy Gradients In Reinforcement Learning Explained

Traditional policy gradient methods are fundamentally flawed. Natural gr...
research
06/25/2023

Provably Convergent Policy Optimization via Metric-aware Trust Region Methods

Trust-region methods based on Kullback-Leibler divergence are pervasivel...
research
08/09/2018

Policy Optimization as Wasserstein Gradient Flows

Policy optimization is a core component of reinforcement learning (RL), ...
research
12/29/2017

f-Divergence constrained policy improvement

To ensure stability of learning, state-of-the-art generalized policy ite...
research
06/29/2022

Discrete Langevin Sampler via Wasserstein Gradient Flow

Recently, a family of locally balanced (LB) samplers has demonstrated ex...

Please sign up or login with your details

Forgot password? Click here to reset