A general sample complexity analysis of vanilla policy gradient

07/23/2021
by   Rui Yuan, et al.
0

The policy gradient (PG) is one of the most popular methods for solving reinforcement learning (RL) problems. However, a solid theoretical understanding of even the "vanilla" PG has remained elusive for long time. In this paper, we apply recent tools developed for the analysis of SGD in non-convex optimization to obtain convergence guarantees for both REINFORCE and GPOMDP under smoothness assumption on the objective function and weak conditions on the second moment of the norm of the estimated gradient. When instantiated under common assumptions on the policy space, our general result immediately recovers existing 𝒪(ϵ^-4) sample complexity guarantees, but for wider ranges of parameters (e.g., step size and batch size m) with respect to previous literature. Notably, our result includes the single trajectory case (i.e., m=1) and it provides a more accurate analysis of the dependency on problem-specific parameters by fixing previous results available in the literature. We believe that the integration of state-of-the-art tools from non-convex optimization may lead to identify a much broader range of problems where PG methods enjoy strong theoretical guarantees.

READ FULL TEXT
research
12/02/2020

Sample Complexity of Policy Gradient Finding Second-Order Stationary Points

The goal of policy-based reinforcement learning (RL) is to search the ma...
research
01/08/2020

A Nonparametric Offpolicy Policy Gradient

Reinforcement learning (RL) algorithms still suffer from high sample com...
research
01/17/2018

An Empirical Analysis of Proximal Policy Optimization with Kronecker-factored Natural Gradients

In this technical report, we consider an approach that combines the PPO ...
research
10/28/2019

Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval

In recent literature, a general two step procedure has been formulated f...
research
09/09/2023

Global Convergence of Receding-Horizon Policy Search in Learning Estimator Designs

We introduce the receding-horizon policy gradient (RHPG) algorithm, the ...
research
10/19/2021

On the Global Convergence of Momentum-based Policy Gradient

Policy gradient (PG) methods are popular and efficient for large-scale r...
research
07/21/2022

Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement Learning

Optimizing noisy functions online, when evaluating the objective require...

Please sign up or login with your details

Forgot password? Click here to reset