Online Learning with Diverse User Preferences

by   Chao Gan, et al.

In this paper, we investigate the impact of diverse user preference on learning under the stochastic multi-armed bandit (MAB) framework. We aim to show that when the user preferences are sufficiently diverse and each arm can be optimal for certain users, the O(log T) regret incurred by exploring the sub-optimal arms under the standard stochastic MAB setting can be reduced to a constant. Our intuition is that to achieve sub-linear regret, the number of times an optimal arm being pulled should scale linearly in time; when all arms are optimal for certain users and pulled frequently, the estimated arm statistics can quickly converge to their true values, thus reducing the need of exploration dramatically. We cast the problem into a stochastic linear bandits model, where both the users preferences and the state of arms are modeled as independent and identical distributed (i.i.d) d-dimensional random vectors. After receiving the user preference vector at the beginning of each time slot, the learner pulls an arm and receives a reward as the linear product of the preference vector and the arm state vector. We also assume that the state of the pulled arm is revealed to the learner once its pulled. We propose a Weighted Upper Confidence Bound (W-UCB) algorithm and show that it can achieve a constant regret when the user preferences are sufficiently diverse. The performance of W-UCB under general setups is also completely characterized and validated with synthetic data.


page 1

page 2

page 3

page 4


A Novel Confidence-Based Algorithm for Structured Bandits

We study finite-armed stochastic bandits where the rewards of each arm m...

Incentivized Bandit Learning with Self-Reinforcing User Preferences

In this paper, we investigate a new multi-armed bandit (MAB) online lear...

Bandit Learning with Positive Externalities

Many platforms are characterized by the fact that future user arrivals a...

Ballooning Multi-Armed Bandits

In this paper, we introduce Ballooning Multi-Armed Bandits (BL-MAB), a n...

Contextual-Bandit Based Personalized Recommendation with Time-Varying User Interests

A contextual bandit problem is studied in a highly non-stationary enviro...

Safe Linear Stochastic Bandits

We introduce the safe linear stochastic bandit framework—a generalizatio...

Learning with Exposure Constraints in Recommendation Systems

Recommendation systems are dynamic economic systems that balance the nee...

Please sign up or login with your details

Forgot password? Click here to reset