Deterministic Policy Optimization by Combining Pathwise and Score Function Estimators for Discrete Action Spaces

11/21/2017
by   Daniel Lévy, et al.
0

Policy optimization methods have shown great promise in solving complex reinforcement and imitation learning tasks. While model-free methods are broadly applicable, they often require many samples to optimize complex policies. Model-based methods greatly improve sample-efficiency but at the cost of poor generalization, requiring a carefully handcrafted model of the system dynamics for each task. Recently, hybrid methods have been successful in trading off applicability for improved sample-complexity. However, these have been limited to continuous action spaces. In this work, we present a new hybrid method based on an approximation of the dynamics as an expectation over the next state under the current policy. This relaxation allows us to derive a novel hybrid policy gradient estimator, combining score function and pathwise derivative estimators, that is applicable to discrete action spaces. We show significant gains in sample complexity, ranging between 1.7 and 25×, when learning parameterized policies on Cart Pole, Acrobot, Mountain Car and Hand Mass. Our method is applicable to both discrete and continuous action spaces, when competing pathwise methods are limited to the latter.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2019

Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space

This paper proposes a novel deep reinforcement learning architecture tha...
research
05/19/2017

Model-Based Planning in Discrete Action Spaces

Planning actions using learned and differentiable forward models of the ...
research
02/03/2023

Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies

Recently, the impressive empirical success of policy gradient (PG) metho...
research
06/12/2020

Combining Model-Based and Model-Free Methods for Nonlinear Control: A Provably Convergent Policy Gradient Approach

Model-free learning-based control methods have seen great success recent...
research
09/09/2015

Continuous control with deep reinforcement learning

We adapt the ideas underlying the success of Deep Q-Learning to the cont...
research
09/26/2019

V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control

Some of the most successful applications of deep reinforcement learning ...
research
01/28/2022

On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces

We focus on parameterized policy search for reinforcement learning over ...

Please sign up or login with your details

Forgot password? Click here to reset