Trust-Region-Free Policy Optimization for Stochastic Policies

02/15/2023
by   Mingfei Sun, et al.
0

Trust Region Policy Optimization (TRPO) is an iterative method that simultaneously maximizes a surrogate objective and enforces a trust region constraint over consecutive policies in each iteration. The combination of the surrogate objective maximization and the trust region enforcement has been shown to be crucial to guarantee a monotonic policy improvement. However, solving a trust-region-constrained optimization problem can be computationally intensive as it requires many steps of conjugate gradient and a large number of on-policy samples. In this paper, we show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee. The key idea is to generalize the surrogate objective used in TRPO in a way that a monotonic improvement guarantee still emerges as a result of constraining the maximum advantage-weighted ratio between policies. This new constraint outlines a conservative mechanism for iterative policy optimization and sheds light on practical ways to optimize the generalized surrogate objective. We show that the new constraint can be effectively enforced by being conservative when optimizing the generalized objective function in practice. We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) as it is free of any explicit trust region constraints. Empirical results show that TREFree outperforms TRPO and Proximal Policy Optimization (PPO) in terms of policy performance and sample efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2022

Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO

We present a new monotonic improvement guarantee for optimizing decentra...
research
10/10/2017

On- and Off-Policy Monotonic Policy Improvement

Monotonic policy improvement and off-policy learning are two main desira...
research
12/03/2021

An Analytical Update Rule for General Policy Optimization

We present an analytical policy update rule that is independent of param...
research
01/31/2022

You May Not Need Ratio Clipping in PPO

Proximal Policy Optimization (PPO) methods learn a policy by iteratively...
research
10/15/2020

Solving Trust Region Subproblems Using Riemannian Optimization

The Trust Region Subproblem is a fundamental optimization problem that t...
research
09/06/2019

Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs

Trust region policy optimization (TRPO) is a popular and empirically suc...
research
07/31/2020

Queueing Network Controls via Deep Reinforcement Learning

Novel advanced policy gradient (APG) methods with conservative policy it...

Please sign up or login with your details

Forgot password? Click here to reset