Estimating Optimal Policy Value in General Linear Contextual Bandits

by   Jonathan N. Lee, et al.

In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We refer to this as V^* estimation. It was recently shown that fast V^* estimation is possible but only in disjoint linear bandits with Gaussian covariates. Whether this is possible for more realistic context distributions has remained an open and important question for tasks such as model selection. In this paper, we first provide lower bounds showing that this general problem is hard. However, under stronger assumptions, we give an algorithm and analysis proving that 𝒪(√(d)) sublinear estimation of V^* is indeed information-theoretically possible, where d is the dimension. We then present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on V^* that holds for general distributions and is tight when the context distribution is Gaussian. We prove our algorithm requires only 𝒪(√(d)) samples to estimate the upper bound. We use this upper bound and the estimator to obtain novel and improved guarantees for several applications in bandit model selection and testing for treatment effects.


page 1

page 2

page 3

page 4


Sublinear Optimal Policy Value Estimation in Contextual Bandits

We study the problem of estimating the expected reward of the optimal po...

Model selection for contextual bandits

We introduce the problem of model selection for contextual bandits, wher...

Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms

Motivated by online recommendation systems, we propose the problem of fi...

The Pareto Frontier of model selection for general Contextual Bandits

Recent progress in model selection raises the question of the fundamenta...

A Unified Framework of Policy Learning for Contextual Bandit with Confounding Bias and Missing Observations

We study the offline contextual bandit problem, where we aim to acquire ...

Learning to Act Greedily: Polymatroid Semi-Bandits

Many important optimization problems, such as the minimum spanning tree ...

Reward Selection with Noisy Observations

We study a fundamental problem in optimization under uncertainty. There ...

Please sign up or login with your details

Forgot password? Click here to reset