A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback

06/29/2022
by   Saeed Masoudian, et al.
0

We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is 𝒪(√(TK) + √(dTlog K)), where T is the time horizon, K is the number of arms, and d is the fixed delay, whereas the stochastic regret guarantee is 𝒪(∑_i ≠ i^*(1/Δ_ilog(T) + d/Δ_ilog K) + d K^1/3log K), where Δ_i are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay d_max and achieves 𝒪(√(TK) + √(Dlog K) + d_maxK^1/3log K) regret in the adversarial regime, where D is the total delay, and 𝒪(∑_i ≠ i^*(1/Δ_ilog(T) + σ_max/Δ_ilog K) + d_maxK^1/3log K) regret in the stochastic regime, where σ_max is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2019

An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays

We propose a new algorithm for adversarial multi-armed bandits with unre...
research
02/19/2021

An Algorithm for Stochastic and Adversarial Bandits with Switching Costs

We propose an algorithm for stochastic and adversarial multiarmed bandit...
research
08/21/2023

An Improved Best-of-both-worlds Algorithm for Bandits with Delayed Feedback

We propose a new best-of-both-worlds algorithm for bandits with variably...
research
07/02/2018

Adaptation to Easy Data in Prediction with Limited Advice

We derive an online learning algorithm with improved regret guarantees f...
research
07/12/2022

Simultaneously Learning Stochastic and Adversarial Bandits under the Position-Based Model

Online learning to rank (OLTR) interactively learns to choose lists of i...
research
06/03/2019

Nonstochastic Multiarmed Bandits with Unrestricted Delays

We investigate multiarmed bandits with delayed feedback, where the delay...
research
05/30/2023

Delayed Bandits: When Do Intermediate Observations Help?

We study a K-armed bandit with delayed feedback and intermediate observa...

Please sign up or login with your details

Forgot password? Click here to reset