Be Aware of Non-Stationarity: Nearly Optimal Algorithms for Piecewise-Stationary Cascading Bandits

09/12/2019

∙

Cascading bandit (CB) is a variant of both the multi-armed bandit (MAB) and the cascade model (CM), where a learning agent aims to maximize the total reward by recommending K out of L items to a user. We focus on a common real-world scenario where the user's preference can change in a piecewise-stationary manner. Two efficient algorithms, GLRT-CascadeUCB and GLRT-CascadeKL-UCB, are developed. The key idea behind the proposed algorithms is incorporating an almost parameter-free change-point detector, the Generalized Likelihood Ratio Test (GLRT), within classical upper confidence bound (UCB) based algorithms. Gap-dependent regret upper bounds of the proposed algorithms are derived and both match the lower bound Ω(√(T)) up to a poly-logarithmic factor √(T) in the number of time steps T. We also present numerical experiments on both synthetic and real-world datasets to show that GLRT-CascadeUCB and GLRT-CascadeKL-UCB outperform state-of-the-art algorithms in the literature.

READ FULL TEXT

Be Aware of Non-Stationarity: Nearly Optimal Algorithms for Piecewise-Stationary Cascading Bandits

Sign in with Google

Consider DeepAI Pro