Bandits with many optimal arms
We consider a stochastic bandit problem with a possibly infinite number of arms. We write p^* for the proportion of optimal arms and Δ for the minimal mean-gap between optimal and sub-optimal arms. We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting in terms of the problem parameters T (the budget), p^* and Δ. For the objective of minimizing the cumulative regret, we provide a lower bound of order Ω(log(T)/(p^*Δ)) and a UCB-style algorithm with matching upper bound up to a factor of log(1/Δ). Our algorithm needs p^* to calibrate its parameters, and we prove that this knowledge is necessary, since adapting to p^* in this setting is impossible. For best-arm identification we also provide a lower bound of order Ω(exp(-cTΔ^2p^*)) on the probability of outputting a sub-optimal arm where c>0 is an absolute constant. We also provide an elimination algorithm with an upper bound matching the lower bound up to a factor of order log(1/Δ) in the exponential, and that does not need p^* or Δ as parameter.
READ FULL TEXT