Restless Bandits with Many Arms: Beating the Central Limit Theorem

by   Xiangyu Zhang, et al.

We consider finite-horizon restless bandits with multiple pulls per period, which play an important role in recommender systems, active learning, revenue management, and many other areas. While an optimal policy can be computed, in principle, using dynamic programming, the computation required scales exponentially in the number of arms N. Thus, there is substantial value in understanding the performance of index policies and other policies that can be computed efficiently for large N. We study the growth of the optimality gap, i.e., the loss in expected performance compared to an optimal policy, for such policies in a classical asymptotic regime proposed by Whittle in which N grows while holding constant the fraction of arms that can be pulled per period. Intuition from the Central Limit Theorem and the tightest previous theoretical bounds suggest that this optimality gap should grow like O(√(N)). Surprisingly, we show that it is possible to outperform this bound. We characterize a non-degeneracy condition and a wide class of novel practically-computable policies, called fluid-priority policies, in which the optimality gap is O(1). These include most widely-used index policies. When this non-degeneracy condition does not hold, we show that fluid-priority policies nevertheless have an optimality gap that is O(√(N)), significantly generalizing the class of policies for which convergence rates are known. We demonstrate that fluid-priority policies offer state-of-the-art performance on a collection of restless bandit problems in numerical experiments.


page 1

page 2

page 3

page 4


Near-optimality for infinite-horizon restless bandits with many arms

Restless bandits are an important class of problems with applications in...

Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits

We study the problem of planning restless multi-armed bandits (RMABs) wi...

Stochastic Bandits with Delay-Dependent Payoffs

Motivated by recommendation problems in music streaming platforms, we pr...

Sequential Decision Making under Uncertainty with Dynamic Resource Constraints

This paper studies a class of constrained restless multi-armed bandits. ...

Continuous-in-time Limit for Bayesian Bandits

This paper revisits the bandit problem in the Bayesian setting. The Baye...

Restless Bandits with Average Reward: Breaking the Uniform Global Attractor Assumption

We study the infinite-horizon restless bandit problem with the average r...

Convergence of Finite Memory Q-Learning for POMDPs and Near Optimality of Learned Policies under Filter Stability

In this paper, for POMDPs, we provide the convergence of a Q learning al...

Please sign up or login with your details

Forgot password? Click here to reset