A Simple and Optimal Policy Design with Safety against Heavy-tailed Risk for Multi-armed Bandits

by   David Simchi-Levi, et al.

We design new policies that ensure both worst-case optimality for expected regret and light-tailed risk for regret distribution in the stochastic multi-armed bandit problem. Recently, arXiv:2109.13595 showed that information-theoretically optimized bandit algorithms suffer from some serious heavy-tailed risk; that is, the worst-case probability of incurring a linear regret slowly decays at a polynomial rate of 1/T, as T (the time horizon) increases. Inspired by their results, we further show that widely used policies (e.g., Upper Confidence Bound, Thompson Sampling) also incur heavy-tailed risk; and this heavy-tailed risk actually exists for all "instance-dependent consistent" policies. With the aim to ensure safety against such heavy-tailed risk, starting from the two-armed bandit setting, we provide a simple policy design that (i) has the worst-case optimality for the expected regret at order Õ(√(T)) and (ii) has the worst-case tail probability of incurring a linear regret decay at an optimal exponential rate exp(-Ω(√(T))). Next, we improve the policy design and analysis to the general K-armed bandit setting. We provide explicit tail probability bound for any regret threshold under our policy design. Specifically, the worst-case probability of incurring a regret larger than x is upper bounded by exp(-Ω(x/√(KT))). We also enhance the policy design to accommodate the "any-time" setting where T is not known a priori, and prove equivalently desired policy performances as compared to the "fixed-time" setting with known T. A brief account of numerical experiments is conducted to illustrate the theoretical findings. Our results reveal insights on the incompatibility between consistency and light-tailed risk, whereas indicate that worst-case optimality on expected regret and light-tailed risk on regret distribution are compatible.


page 1

page 2

page 3

page 4


Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

We study the trade-off between expectation and tail risk for regret dist...

Minimax Policy for Heavy-tailed Multi-armed Bandits

We study the stochastic Multi-Armed Bandit (MAB) problem under worst cas...

Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax Regret

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem i...

Decision Variance in Online Learning

Online learning has classically focused on the expected behaviour of lea...

Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits

In the stochastic multi-armed bandit problem, a randomized probability m...

Thompson Sampling on Symmetric α-Stable Bandits

Thompson Sampling provides an efficient technique to introduce prior kno...

Continuous Assortment Optimization with Logit Choice Probabilities under Incomplete Information

We consider assortment optimization in relation to a product for which a...

Please sign up or login with your details

Forgot password? Click here to reset