Grooming a Single Bandit Arm

06/11/2020
by   Eren Ozbay, et al.
0

The stochastic multi-armed bandit problem captures the fundamental exploration vs. exploitation tradeoff inherent in online decision-making in uncertain settings. However, in several applications, the traditional objective of maximizing the expected sum of rewards obtained can be inappropriate. Motivated by the problem of optimizing job assignments to groom novice workers with unknown trainability in labor platforms, we consider a new objective in the classical setup. Instead of maximizing the expected total reward from T pulls, we consider the vector of cumulative rewards earned from each of the K arms at the end of T pulls, and aim to maximize the expected value of the highest cumulative reward. This corresponds to the objective of grooming a single, highly skilled worker using a limited supply of training jobs. For this new objective, we show that any policy must incur a regret of Ω(K^1/3T^2/3) in the worst case. We design an explore-then-commit policy featuring exploration based on finely tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and guarantees a regret of O(K^1/3T^2/3√(log K)) in the worst case. Our numerical experiments demonstrate that this policy improves upon several natural candidate policies for this setting.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/22/2021

Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax Regret

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem i...
research
12/02/2021

Risk-Aware Algorithms for Combinatorial Semi-Bandits

In this paper, we study the stochastic combinatorial multi-armed bandit ...
research
06/28/2021

Dynamic Planning and Learning under Recovering Rewards

Motivated by emerging applications such as live-streaming e-commerce, pr...
research
01/12/2016

Infomax strategies for an optimal balance between exploration and exploitation

Proper balance between exploitation and exploration is what makes good d...
research
08/19/2022

Mitigating Disparity while Maximizing Reward: Tight Anytime Guarantee for Improving Bandits

We study the Improving Multi-Armed Bandit (IMAB) problem, where the rewa...
research
04/14/2023

Repeated Principal-Agent Games with Unobserved Agent Rewards and Perfect-Knowledge Agents

Motivated by a number of real-world applications from domains like healt...
research
07/17/2018

Continuous Assortment Optimization with Logit Choice Probabilities under Incomplete Information

We consider assortment optimization in relation to a product for which a...

Please sign up or login with your details

Forgot password? Click here to reset