A short variational proof of equivalence between policy gradients and soft Q learning

12/22/2017
by   Pierre H. Richemond, et al.
0

Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2020

A Short Note on Soft-max and Policy Gradients in Bandits Problems

This is a short communication on a Lyapunov function argument for softma...
research
01/07/2020

Reinforcement Learning via Fenchel-Rockafellar Duality

We review basic concepts of convex duality, focusing on the very general...
research
05/18/2020

Entropy-Augmented Entropy-Regularized Reinforcement Learning and a Continuous Path from Policy Gradient to Q-Learning

Entropy augmented to reward is known to soften the greedy argmax policy ...
research
06/29/2019

Conjugate Gradients and Accelerated Methods Unified: The Approximate Duality Gap View

This note provides a novel, simple analysis of the method of conjugate g...
research
02/10/2018

Path Consistency Learning in Tsallis Entropy Regularized MDPs

We study the sparse entropy-regularized reinforcement learning (ERL) pro...
research
12/19/2019

Invertible Gaussian Reparameterization: Revisiting the Gumbel-Softmax

The Gumbel-Softmax is a continuous distribution over the simplex that is...
research
06/26/2023

Recurrence and repetition times in the case of a stretched exponential growth

By an analogy to the duality between the recurrence time and the longest...

Please sign up or login with your details

Forgot password? Click here to reset