On Bellman's principle of optimality and Reinforcement learning for safety-constrained Markov decision process

by   Rahul Misra, et al.

We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider a constrained Markov decision process (with finite states and finite actions) where the goal of the decision maker is to reach a target set while avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be multichain since by definition there exists a target set and an unsafe set. The decision maker also has to be optimal (with respect to a cost function) while navigating to the target set. This gives rise to a multi-objective optimization problem. We highlight the fact that Bellman's principle of optimality may not hold for constrained Markov decision problems with an underlying multichain structure (as shown by the counterexample). We resolve the counterexample by formulating the aforementioned multi-objective optimization problem as a zero-sum game and thereafter construct an asynchronous value iteration scheme for the Lagrangian (similar to Shapley's algorithm. Finally, we consider the reinforcement learning problem for the same and construct a modified Q-learning algorithm for learning the Lagrangian from data. We also provide a lower bound on the number of iterations required for learning the Lagrangian and corresponding error bounds.


Safe Policies for Reinforcement Learning via Primal-Dual Methods

In this paper, we study the learning of safe policies in the setting of ...

Recursive Constraints to Prevent Instability in Constrained Reinforcement Learning

We consider the challenge of finding a deterministic policy for a Markov...

Adaptive Multi-Goal Exploration

We introduce a generic strategy for provably efficient multi-goal explor...

Safe Wasserstein Constrained Deep Q-Learning

This paper presents a distributionally robust Q-Learning algorithm (DrQ)...

Scaling up budgeted reinforcement learning

Can we learn a control policy able to adapt its behaviour in real time s...

Style Miner: Find Significant and Stable Explanatory Factors in Time Series with Constrained Reinforcement Learning

In high-dimensional time-series analysis, it is essential to have a set ...

Dynamic Resource Configuration for Low-Power IoT Networks: A Multi-Objective Reinforcement Learning Method

Considering grant-free transmissions in low-power IoT networks with unkn...

Please sign up or login with your details

Forgot password? Click here to reset