Conformal Off-Policy Evaluation in Markov Decision Processes

04/05/2023
by   Daniele Foffano, et al.
0

Reinforcement Learning aims at identifying and evaluating efficient control policies from data. In many real-world applications, the learner is not allowed to experiment and cannot gather data in an online manner (this is the case when experimenting is expensive, risky or unethical). For such applications, the reward of a given policy (the target policy) must be estimated using historical data gathered under a different policy (the behavior policy). Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty. The main challenge in OPE stems from the distribution shift due to the discrepancies between the target and the behavior policies. We propose and empirically evaluate different ways to deal with this shift. Some of these methods yield conformalized intervals with reduced length compared to existing approaches, while maintaining the same certainty level.

READ FULL TEXT
research
06/22/2021

Variance-Aware Off-Policy Evaluation with Linear Function Approximation

We study the off-policy evaluation (OPE) problem in reinforcement learni...
research
06/07/2021

Offline Policy Comparison under Limited Historical Agent-Environment Interactions

We address the challenge of policy evaluation in real-world applications...
research
06/14/2022

Conformal Off-Policy Prediction

Off-policy evaluation is critical in a number of applications where new ...
research
03/02/2021

Minimax Model Learning

We present a novel off-policy loss function for learning a transition mo...
research
07/03/2021

Supervised Off-Policy Ranking

Off-policy evaluation (OPE) leverages data generated by other policies t...
research
12/29/2022

An Instrumental Variable Approach to Confounded Off-Policy Evaluation

Off-policy evaluation (OPE) is a method for estimating the return of a t...
research
06/10/2019

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

In many real-world reinforcement learning applications, access to the en...

Please sign up or login with your details

Forgot password? Click here to reset