Bandits with Side Observations: Bounded vs. Logarithmic Regret

07/10/2018

∙

We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency ϵ, an extra observation is gathered by the agent for free. We prove that, no matter how small ϵ is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than ∑_i (1/ϵ)/Δ_i, up to multiplicative constant and loglog terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.

READ FULL TEXT

Bandits with Side Observations: Bounded vs. Logarithmic Regret

Sign in with Google

Consider DeepAI Pro