Reproducible Bandits

by   Hossein Esfandiari, et al.

In this paper, we introduce the notion of reproducible policies in the context of stochastic bandits, one of the canonical problems in interactive learning. A policy in the bandit environment is called reproducible if it pulls, with high probability, the exact same sequence of arms in two different and independent executions (i.e., under independent reward realizations). We show that not only do reproducible policies exist, but also they achieve almost the same optimal (non-reproducible) regret bounds in terms of the time horizon. More specifically, in the stochastic multi-armed bandits setting, we develop a policy with an optimal problem-dependent regret bound whose dependence on the reproducibility parameter is also optimal. Similarly, for stochastic linear bandits (with finitely and infinitely many arms) we develop reproducible policies that achieve the best-known problem-independent regret bounds with an optimal dependency on the reproducibility parameter. Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions.


page 1

page 2

page 3

page 4


Online Learning and Bandits with Queried Hints

We consider the classic online learning and stochastic multi-armed bandi...

Stochastic Bandits with Delay-Dependent Payoffs

Motivated by recommendation problems in music streaming platforms, we pr...

Batched Bandits with Crowd Externalities

In Batched Multi-Armed Bandits (BMAB), the policy is not allowed to be u...

Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits

We study the problem of planning restless multi-armed bandits (RMABs) wi...

Taking a hint: How to leverage loss predictors in contextual bandits?

We initiate the study of learning in contextual bandits with the help of...

Episodic Bandits with Stochastic Experts

We study a version of the contextual bandit problem where an agent is gi...

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

This work addresses the problem of regret minimization in non-stochastic...

Please sign up or login with your details

Forgot password? Click here to reset