An Incremental Off-policy Search in a Model-free Markov Decision Process Using a Single Sample Path

by   Ajin George Joseph, et al.

In this paper, we consider a modified version of the control problem in a model free Markov decision process (MDP) setting with large state and action spaces. The control problem most commonly addressed in the contemporary literature is to find an optimal policy which maximizes the value function, i.e., the long run discounted reward of the MDP. The current settings also assume access to a generative model of the MDP with the hidden premise that observations of the system behaviour in the form of sample trajectories can be obtained with ease from the model. In this paper, we consider a modified version, where the cost function is the expectation of a non-convex function of the value function without access to the generative model. Rather, we assume that a sample trajectory generated using a priori chosen behaviour policy is made available. In this restricted setting, we solve the modified control problem in its true sense, i.e., to find the best possible policy given this limited information. We propose a stochastic approximation algorithm based on the well-known cross entropy method which is data (sample trajectory) efficient, stable, robust as well as computationally and storage efficient. We provide a proof of convergence of our algorithm to a policy which is globally optimal relative to the behaviour policy. We also present experimental results to corroborate our claims and we demonstrate the superiority of the solution produced by our algorithm compared to the state-of-the-art algorithms under appropriately chosen behaviour policy.


page 15

page 35


Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

In this paper we consider the problem of learning an ϵ-optimal policy fo...

On the Optimality of Sparse Model-Based Planning for Markov Decision Processes

This work considers the sample complexity of obtaining an ϵ-optimal poli...

Successive Over Relaxation Q-Learning

In a discounted reward Markov Decision Process (MDP) the objective is to...

Non-asymptotic Performances of Robust Markov Decision Processes

In this paper, we study the non-asymptotic performance of optimal policy...

Model-Free Learning and Optimal Policy Design in Multi-Agent MDPs Under Probabilistic Agent Dropout

This work studies a multi-agent Markov decision process (MDP) that can u...

Loop estimator for discounted values in Markov reward processes

At the working heart of policy iteration algorithms commonly used and st...

Provably Efficient Maximum Entropy Exploration

Suppose an agent is in a (possibly unknown) Markov decision process (MDP...

Please sign up or login with your details

Forgot password? Click here to reset