Regime Switching Bandits
We study a multi-armed bandit problem where the rewards exhibit regime-switching. Specifically, the distributions of the random rewards generated from all arms depend on a common underlying state modeled as a finite-state Markov chain. The agent does not observe the underlying state and has to learn the unknown transition probability matrix as well as the reward distribution. We propose an efficient learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models and upper confidence bound methods for reinforcement learning. We also establish O(T^2/3√(log T)) bound on the regret of the proposed learning algorithm where T is the unknown horizon. Finally, we conduct numerical experiments to illustrate the effectiveness of the learning algorithm.
READ FULL TEXT