Maillard Sampling: Boltzmann Exploration Done Optimally

11/05/2021

∙

The PhD thesis of Maillard (2013) presents a randomized algorithm for the K-armed bandit problem. This less-known algorithm, which we call Maillard sampling (MS), computes the probability of choosing each arm in a closed form, which is useful for counterfactual evaluation from bandit-logged data but was lacking from Thompson sampling, a widely-adopted bandit algorithm in the industry. Motivated by such merit, we revisit MS and perform an improved analysis to show that it achieves both the asymptotical optimality and √(KTlogT) minimax regret bound where T is the time horizon, which matches the standard asymptotically optimal UCB's performance. We then propose a variant of MS called MS^+ that improves its minimax bound to √(KTlogK) without losing the asymptotic optimality. MS^+ can also be tuned to be aggressive (i.e., less exploration) without losing theoretical guarantees, a unique feature unavailable from existing bandit algorithms. Our numerical evaluation shows the effectiveness of MS^+.

READ FULL TEXT

Maillard Sampling: Boltzmann Exploration Done Optimally

Sign in with Google

Consider DeepAI Pro