Variance-Dependent Best Arm Identification
We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of n arms indexed from 1 to n, each arm i is associated with an unknown reward distribution supported on [0,1] with mean θ_i and variance σ_i^2. Assume θ_1 > θ_2 ≥⋯≥θ_n. We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called grouped median elimination. The proposed algorithm guarantees to output the best arm with probability (1-δ) and uses at most O (∑_i = 1^n (σ_i^2/Δ_i^2 + 1/Δ_i)(lnδ^-1 + lnlnΔ_i^-1)) samples, where Δ_i (i ≥ 2) denotes the reward gap between arm i and the best arm and we define Δ_1 = Δ_2. This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra ln n factor on the best arm compared with the state-of-the-art. We further show that Ω( ∑_i = 1^n ( σ_i^2/Δ_i^2 + 1/Δ_i) lnδ^-1) samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.
READ FULL TEXT