Reward Selection with Noisy Observations
We study a fundamental problem in optimization under uncertainty. There are n boxes; each box i contains a hidden reward x_i. Rewards are drawn i.i.d. from an unknown distribution ๐. For each box i, we see y_i, an unbiased estimate of its reward, which is drawn from a Normal distribution with known standard deviation ฯ_i (and an unknown mean x_i). Our task is to select a single box, with the goal of maximizing our reward. This problem captures a wide range of applications, e.g. ad auctions, where the hidden reward is the click-through rate of an ad. Previous work in this model [BKMR12] proves that the naive policy, which selects the box with the largest estimate y_i, is suboptimal, and suggests a linear policy, which selects the box i with the largest y_i - c ยทฯ_i, for some c > 0. However, no formal guarantees are given about the performance of either policy (e.g., whether their expected reward is within some factor of the optimal policy's reward). In this work, we prove that both the naive policy and the linear policy are arbitrarily bad compared to the optimal policy, even when ๐ is well-behaved, e.g. has monotone hazard rate (MHR), and even under a "small tail" condition, which requires that not too many boxes have arbitrarily large noise. On the flip side, we propose a simple threshold policy that gives a constant approximation to the reward of a prophet (who knows the realized values x_1, โฆ, x_n) under the same "small tail" condition. We prove that when this condition is not satisfied, even an optimal clairvoyant policy (that knows ๐) cannot get a constant approximation to the prophet, even for MHR distributions, implying that our threshold policy is optimal against the prophet benchmark, up to constants.
READ FULL TEXT