: Fair Multi-Armed Bandits with Guaranteed Rewards per Arm

04/11/2023
by   Abhishek Sinha, et al.
0

Classic no-regret online prediction algorithms, including variants of the Upper Confidence Bound () algorithm, , and , are inherently unfair by design. The unfairness stems from their very objective of playing the most rewarding arm as many times as possible while ignoring the less rewarding ones among N arms. In this paper, we consider a fair prediction problem in the stochastic setting with hard lower bounds on the rate of accrual of rewards for a set of arms. We study the problem in both full and bandit feedback settings. Using queueing-theoretic techniques in conjunction with adversarial learning, we propose a new online prediction policy called that achieves the target reward rates while achieving a regret and target rate violation penalty of O(T^3/4). In the full-information setting, the regret bound can be further improved to O(√(T)) when considering the average regret over the entire horizon of length T. The proposed policy is efficient and admits a black-box reduction from the fair prediction problem to the standard MAB problem with a carefully defined sequence of rewards. The design and analysis of the policy involve a novel use of the potential function method in conjunction with scale-free second-order regret bounds and a new self-bounding inequality for the reward gradients, which are of independent interest.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro