A general sample complexity analysis of vanilla policy gradient

by   Rui Yuan, et al.

The policy gradient (PG) is one of the most popular methods for solving reinforcement learning (RL) problems. However, a solid theoretical understanding of even the "vanilla" PG has remained elusive for long time. In this paper, we apply recent tools developed for the analysis of SGD in non-convex optimization to obtain convergence guarantees for both REINFORCE and GPOMDP under smoothness assumption on the objective function and weak conditions on the second moment of the norm of the estimated gradient. When instantiated under common assumptions on the policy space, our general result immediately recovers existing 𝒪(ϵ^-4) sample complexity guarantees, but for wider ranges of parameters (e.g., step size and batch size m) with respect to previous literature. Notably, our result includes the single trajectory case (i.e., m=1) and it provides a more accurate analysis of the dependency on problem-specific parameters by fixing previous results available in the literature. We believe that the integration of state-of-the-art tools from non-convex optimization may lead to identify a much broader range of problems where PG methods enjoy strong theoretical guarantees.


Sample Complexity of Policy Gradient Finding Second-Order Stationary Points

The goal of policy-based reinforcement learning (RL) is to search the ma...

A Nonparametric Offpolicy Policy Gradient

Reinforcement learning (RL) algorithms still suffer from high sample com...

An Empirical Analysis of Proximal Policy Optimization with Kronecker-factored Natural Gradients

In this technical report, we consider an approach that combines the PPO ...

Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval

In recent literature, a general two step procedure has been formulated f...

Global Convergence of Receding-Horizon Policy Search in Learning Estimator Designs

We introduce the receding-horizon policy gradient (RHPG) algorithm, the ...

On the Global Convergence of Momentum-based Policy Gradient

Policy gradient (PG) methods are popular and efficient for large-scale r...

Log Barriers for Safe Black-box Optimization with Application to Safe Reinforcement Learning

Optimizing noisy functions online, when evaluating the objective require...

Please sign up or login with your details

Forgot password? Click here to reset