An Instrumental Variable Approach to Confounded Off-Policy Evaluation

12/29/2022
by   Yang Xu, et al.
0

Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/22/2022

Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

This paper is concerned with constructing a confidence interval for a ta...
research
11/09/2020

Robust Batch Policy Learning in Markov Decision Processes

We study the sequential decision making problem in Markov decision proce...
research
12/29/2022

Quantile Off-Policy Evaluation via Deep Conditional Generative Learning

Off-Policy evaluation (OPE) is concerned with evaluating a new target po...
research
09/10/2021

Projected State-action Balancing Weights for Offline Reinforcement Learning

Offline policy evaluation (OPE) is considered a fundamental and challeng...
research
09/12/2022

Statistical Estimation of Confounded Linear MDPs: An Instrumental Variable Approach

In an Markov decision process (MDP), unobservable confounders may exist ...
research
04/05/2023

Conformal Off-Policy Evaluation in Markov Decision Processes

Reinforcement Learning aims at identifying and evaluating efficient cont...
research
09/12/2019

Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes

Off-policy evaluation (OPE) in reinforcement learning is notoriously dif...

Please sign up or login with your details

Forgot password? Click here to reset