On the Global Convergence Rates of Softmax Policy Gradient Methods

05/13/2020
by   Jincheng Mei, et al.
4

We make three contributions toward better understanding policy gradient methods in the tabular setting. First, we show that with the true gradient, policy gradient with a softmax parametrization converges at a O(1/t) rate, with constants depending on the problem and initialization. This result significantly expands the recent asymptotic convergence results. The analysis relies on two findings: that the softmax policy gradient satisfies a Łojasiewicz inequality, and the minimum probability of an optimal action during optimization can be bounded in terms of its initial value. Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate O(e^-t) toward softmax optimal policy. This result resolves an open question in the recent literature. Finally, combining the above two results and additional new Ω(1/t) lower bound results, we explain how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate. The separation of rates is further explained using the notion of non-uniform Łojasiewicz degree. These results provide a theoretical understanding of the impact of entropy and corroborate existing empirical studies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2021

Linear Convergence of Entropy-Regularized Natural Policy Gradient with Linear Function Approximation

Natural policy gradient (NPG) methods with function approximation achiev...
research
01/16/2023

The Role of Baselines in Policy Gradient Optimization

We study the effect of baselines in on-policy stochastic policy gradient...
research
05/18/2020

Entropy-Augmented Entropy-Regularized Reinforcement Learning and a Continuous Path from Policy Gradient to Q-Learning

Entropy augmented to reward is known to soften the greedy argmax policy ...
research
02/22/2021

Softmax Policy Gradient Methods Can Take Exponential Time to Converge

The softmax policy gradient (PG) method, which performs gradient ascent ...
research
12/22/2021

An Alternate Policy Gradient Estimator for Softmax Policies

Policy gradient (PG) estimators for softmax policies are ineffective wit...
research
05/13/2021

Leveraging Non-uniformity in First-order Non-convex Optimization

Classical global convergence results for first-order methods rely on uni...
research
10/29/2021

Understanding the Effect of Stochasticity in Policy Optimization

We study the effect of stochasticity in on-policy policy optimization, a...

Please sign up or login with your details

Forgot password? Click here to reset