Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

02/25/2022
by   MohammadJavad Azizi, et al.
0

We study a sequential decision problem where the learner faces a sequence of K-armed stochastic bandit tasks. The tasks may be designed by an adversary, but the adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of M arms. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting), and the number of tasks N as well as the total number of rounds T are known (N could be unknown in the meta-learning setting). We design an algorithm based on a reduction to bandit submodular maximization, and show that its regret in both settings is smaller than the simple baseline of Õ(√(KNT)) that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length τ, we show that the regret of the algorithm is bounded as Õ(N√(M τ)+N^2/3). Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved Õ(N√(M τ)+N^1/2) regret.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset