Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO

by   Mingfei Sun, et al.

We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, i.e., Independent Proximal Policy Optimization (IPPO) and Multi-Agent PPO (MAPPO), which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Moreover, we show that the surrogate objectives optimized in IPPO and MAPPO are essentially equivalent when their critics converge to a fixed point. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and the good values of the hyperparameters for this enforcement are highly sensitive to the number of agents, as predicted by our theoretical analysis.


page 8

page 16

page 18


Decentralized Policy Optimization

The study of decentralized learning or independent learning in cooperati...

Trust-Region-Free Policy Optimization for Stochastic Policies

Trust Region Policy Optimization (TRPO) is an iterative method that simu...

An Analytical Update Rule for General Policy Optimization

We present an analytical policy update rule that is independent of param...

Model-Based Decentralized Policy Optimization

Decentralized policy optimization has been commonly used in cooperative ...

Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via Trust Region Decomposition

Non-stationarity is one thorny issue in multi-agent reinforcement learni...

You May Not Need Ratio Clipping in PPO

Proximal Policy Optimization (PPO) methods learn a policy by iteratively...

Towards an Understanding of Default Policies in Multitask Policy Optimization

Much of the recent success of deep reinforcement learning has been drive...

Please sign up or login with your details

Forgot password? Click here to reset