Optimizing over a Restricted Policy Class in Markov Decision Processes

by   Ershad Banijamali, et al.

We address the problem of finding an optimal policy in a Markov decision process under a restricted policy class defined by the convex hull of a set of base policies. This problem is of great interest in applications in which a number of reasonably good (or safe) policies are already known and we are only interested in optimizing in their convex hull. We show that this problem is NP-hard to solve exactly as well as to approximate to arbitrary accuracy. However, under a condition that is akin to the occupancy measures of the base policies having large overlap, we show that there exists an efficient algorithm that finds a policy that is almost as good as the best convex combination of the base policies. The running time of the proposed algorithm is linear in the number of states and polynomial in the number of base policies. In practice, we demonstrate an efficient implementation for large state problems. Compared to traditional policy gradient methods, the proposed approach has the advantage that, apart from the computation of occupancy measures of some base policies, the iterative method need not interact with the environment during the optimization process. This is especially important in complex systems where estimating the value of a policy can be a time consuming process.


page 1

page 2

page 3

page 4


Solving POMDPs by Searching the Space of Finite Policies

Solving partially observable Markov decision processes (POMDPs) is highl...

On Sample Complexity of Projection-Free Primal-Dual Methods for Learning Mixture Policies in Markov Decision Processes

We study the problem of learning policy of an infinite-horizon, discount...

Large-Scale Markov Decision Problems via the Linear Programming Dual

We consider the problem of controlling a fully specified Markov decision...

Verification of Markov Decision Processes with Risk-Sensitive Measures

We develop a method for computing policies in Markov decision processes ...

A Boosting Approach to Reinforcement Learning

We study efficient algorithms for reinforcement learning in Markov decis...

Technical Report: The Policy Graph Improvement Algorithm

Optimizing a partially observable Markov decision process (POMDP) policy...

A virtual environment for formulation of policy packages

The interdependence and complexity of socio-technical systems and availa...

Please sign up or login with your details

Forgot password? Click here to reset