Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs

05/31/2021
by   Harsh Satija, et al.
5

We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning (RL) setting. We consider the scenario where: (i) we have a dataset collected under a known baseline policy, (ii) multiple reward signals are received from the environment inducing as many objectives to optimize. We present an SPI formulation for this RL setting that takes into account the preferences of the algorithm's user for handling the trade-offs for different reward signals while ensuring that the new policy performs at least as well as the baseline policy along each individual objective. We build on traditional SPI algorithms and propose a novel method based on Safe Policy Iteration with Baseline Bootstrapping (SPIBB, Laroche et al., 2019) that provides high probability guarantees on the performance of the agent in the true environment. We show the effectiveness of our method on a synthetic grid-world safety task as well as in a real-world critical care context to learn a policy for the administration of IV fluids and vasopressors to treat sepsis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/14/2023

Constrained Decision Transformer for Offline Safe Reinforcement Learning

Safe reinforcement learning (RL) trains a constraint satisfaction policy...
research
07/11/2019

Safe Policy Improvement with Soft Baseline Bootstrapping

Batch Reinforcement Learning (Batch RL) consists in training a policy us...
research
10/12/2020

Remote Electrical Tilt Optimization via Safe Reinforcement Learning

Remote Electrical Tilt (RET) optimization is an efficient method for adj...
research
10/20/2022

Safe Policy Improvement in Constrained Markov Decision Processes

The automatic synthesis of a policy through reinforcement learning (RL) ...
research
11/02/2019

Thompson Sampling for Contextual Bandit Problems with Auxiliary Safety Constraints

Recent advances in contextual bandit optimization and reinforcement lear...
research
03/20/2019

Batch Policy Learning under Constraints

When learning policies for real-world domains, two important questions a...
research
01/28/2022

Towards Safe Reinforcement Learning with a Safety Editor Policy

We consider the safe reinforcement learning (RL) problem of maximizing u...

Please sign up or login with your details

Forgot password? Click here to reset