On learning Whittle index policy for restless bandits with scalable regret

02/07/2022
โˆ™
by   Nima Akbarzadeh, et al.
โˆ™
0
โˆ™

Reinforcement learning is an attractive approach to learn good resource allocation and scheduling policies based on data when the system model is unknown. However, the cumulative regret of most RL algorithms scales as ร•(๐–ฒโˆš(๐–  T)), where ๐–ฒ is the size of the state space, ๐–  is the size of the action space, T is the horizon, and the ร•(ยท) notation hides logarithmic terms. Due to the linear dependence on the size of the state space, these regret bounds are prohibitively large for resource allocation and scheduling problems. In this paper, we present a model-based RL algorithm for such problem which has scalable regret. In particular, we consider a restless bandit model, and propose a Thompson-sampling based learning algorithm which is tuned to the underlying structure of the model. We present two characterizations of the regret of the proposed algorithm with respect to the Whittle index policy. First, we show that for a restless bandit with n arms and at most m activations at each time, the regret scales either as ร•(mnโˆš(T)) or ร•(n^2 โˆš(T)) depending on the reward model. Second, under an additional technical assumption, we show that the regret scales as ร•(n^1.5โˆš(T)). We present numerical examples to illustrate the salient features of the algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
โˆ™ 08/18/2021

Scalable regret for learning to control network-coupled subsystems with unknown dynamics

We consider the problem of controlling an unknown linear quadratic Gauss...
research
โˆ™ 02/26/2020

Memory-Constrained No-Regret Learning in Adversarial Bandits

An adversarial bandit problem with memory constraints is studied where o...
research
โˆ™ 09/12/2012

Regret Bounds for Restless Markov Bandits

We consider the restless Markov bandit problem, in which the state of ea...
research
โˆ™ 04/29/2020

Whittle index based Q-learning for restless bandits with average reward

A novel reinforcement learning algorithm is introduced for multiarmed re...
research
โˆ™ 03/03/2022

The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches

In this paper, we study the problem of regret minimization for episodic ...
research
โˆ™ 12/07/2019

No-Regret Exploration in Goal-Oriented Reinforcement Learning

Many popular reinforcement learning problems (e.g., navigation in a maze...
research
โˆ™ 08/19/2021

A relaxed technical assumption for posterior sampling-based reinforcement learning for control of unknown linear systems

We revisit the Thompson sampling algorithm to control an unknown linear ...

Please sign up or login with your details

Forgot password? Click here to reset