Taking a hint: How to leverage loss predictors in contextual bandits?

by   Chen-Yu Wei, et al.

We initiate the study of learning in contextual bandits with the help of loss predictors. The main question we address is whether one can improve over the minimax regret O(√(T)) for learning over T rounds, when the total error of the predictor E≤ T is relatively small. We provide a complete answer to this question, including upper and lower bounds for various settings: adversarial versus stochastic environments, known versus unknown E, and single versus multiple predictors. We show several surprising results, such as 1) the optimal regret is O(min{√(T), √(E)T^1/4}) when E is known, a sharp contrast to the standard and better bound O(√(E)) for non-contextual problems (such as multi-armed bandits); 2) the same bound cannot be achieved if E is unknown, but as a remedy, O(√(E)T^1/3) is achievable; 3) with M predictors, a linear dependence on M is necessary, even if logarithmic dependence is possible for non-contextual problems. We also develop several novel algorithmic techniques to achieve matching upper bounds, including 1) a key action remapping technique for optimal regret with known E, 2) implementing Catoni's robust mean estimator efficiently via an ERM oracle leading to an efficient algorithm in the stochastic setting with optimal regret, 3) constructing an underestimator for E via estimating the histogram with bins of exponentially increasing size for the stochastic setting with unknown E, and 4) a self-referential scheme for learning with multiple predictors, all of which might be of independent interest.


page 1

page 2

page 3

page 4


Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards

In this paper, we consider stochastic multi-armed bandits (MABs) with he...

Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed Bandits

In this paper, we generalize the concept of heavy-tailed multi-armed ban...

Reproducible Bandits

In this paper, we introduce the notion of reproducible policies in the c...

Minimax Regret for Cascading Bandits

Cascading bandits model the task of learning to rank K out of L items ov...

Breaking the √(T) Barrier: Instance-Independent Logarithmic Regret in Stochastic Contextual Linear Bandits

We prove an instance independent (poly) logarithmic regret for stochasti...

Pessimism for Offline Linear Contextual Bandits using ℓ_p Confidence Sets

We present a family {π̂}_p≥ 1 of pessimistic learning rules for offline ...

Stochastic Contextual Bandits with Long Horizon Rewards

The growing interest in complex decision-making and language modeling pr...

Please sign up or login with your details

Forgot password? Click here to reset