Policy Optimization with Model-based Explorations

by   Feiyang Pan, et al.

Model-free reinforcement learning methods such as the Proximal Policy Optimization algorithm (PPO) have successfully applied in complex decision-making problems such as Atari games. However, these methods suffer from high variances and high sample complexity. On the other hand, model-based reinforcement learning methods that learn the transition dynamics are more sample efficient, but they often suffer from the bias of the transition estimation. How to make use of both model-based and model-free learning is a central problem in reinforcement learning. In this paper, we present a new technique to address the trade-off between exploration and exploitation, which regards the difference between model-free and model-based estimations as a measure of exploration value. We apply this new technique to the PPO algorithm and arrive at a new policy optimization method, named Policy Optimization with Model-based Explorations (POME). POME uses two components to predict the actions' target values: a model-free one estimated by Monte-Carlo sampling and a model-based one which learns a transition model and predicts the value of the next state. POME adds the error of these two target estimations as the additional exploration value for each state-action pair, i.e, encourages the algorithm to explore the states with larger target errors which are hard to estimate. We compare POME with PPO on Atari 2600 games, and it shows that POME outperforms PPO on 33 games out of 49 games.


page 1

page 2

page 3

page 4


Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

Recent model-free reinforcement learning algorithms have proposed incorp...

Deep Episodic Value Iteration for Model-based Meta-Reinforcement Learning

We present a new deep meta reinforcement learner, which we call Deep Epi...

FLEX: an Adaptive Exploration Algorithm for Nonlinear Systems

Model-based reinforcement learning is a powerful tool, but collecting da...

MBMF: Model-Based Priors for Model-Free Reinforcement Learning

Reinforcement Learning is divided in two main paradigms: model-free and ...

Model-Based Policy Gradients with Parameter-Based Exploration by Least-Squares Conditional Density Estimation

The goal of reinforcement learning (RL) is to let an agent learn an opti...

Value-Distributional Model-Based Reinforcement Learning

Quantifying uncertainty about a policy's long-term performance is import...

Sample Efficient Reinforcement Learning via Model-Ensemble Exploration and Exploitation

Model-based deep reinforcement learning has achieved success in various ...

Please sign up or login with your details

Forgot password? Click here to reset