Offline Policy Selection under Uncertainty

12/12/2020
by   Mengjiao Yang, et al.
0

The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their policy values or high-confidence intervals, access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We propose BayesDICE for estimating this belief distribution in terms of posteriors of distribution correction ratios derived from stochastic constraints (as opposed to explicit likelihood, which is not available). Empirically, BayesDICE is highly competitive to existing state-of-the-art approaches in confidence interval estimation. More importantly, we show how the belief distribution estimated by BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric, and we empirically demonstrate that this selection procedure significantly outperforms existing approaches, such as ranking policies according to mean or high-confidence lower bound value estimates.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2020

CoinDICE: Off-Policy Confidence Interval Estimation

We study high-confidence behavior-agnostic off-policy evaluation in rein...
research
12/08/2022

Confidence-Conditioned Value Functions for Offline Reinforcement Learning

Offline reinforcement learning (RL) promises the ability to learn effect...
research
06/18/2020

Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting

We consider off-policy evaluation in the contextual bandit setting for t...
research
06/07/2021

Offline Policy Comparison under Limited Historical Agent-Environment Interactions

We address the challenge of policy evaluation in real-world applications...
research
06/10/2019

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

In many real-world reinforcement learning applications, access to the en...
research
05/10/2021

Deeply-Debiased Off-Policy Interval Estimation

Off-policy evaluation learns a target policy's value with a historical d...
research
07/07/2021

Uncertainty in Ranking

Ranks estimated from data are uncertain and this poses a challenge in ma...

Please sign up or login with your details

Forgot password? Click here to reset