Contextual Bandits and Optimistically Universal Learning

12/31/2022
by   Moise Blanchard, et al.
0

We consider the contextual bandit problem on general action and context spaces, where the learner's rewards depend on their selected actions and an observable context. This generalizes the standard multi-armed bandit to the case where side information is available, e.g., patients' records or customers' history, which allows for personalized treatment. We focus on consistency – vanishing regret compared to the optimal policy – and show that for large classes of non-i.i.d. contexts, consistency can be achieved regardless of the time-invariant reward mechanism, a property known as universal consistency. Precisely, we first give necessary and sufficient conditions on the context-generating process for universal consistency to be possible. Second, we show that there always exists an algorithm that guarantees universal consistency whenever this is achievable, called an optimistically universal learning rule. Interestingly, for finite action spaces, learnable processes for universal learning are exactly the same as in the full-feedback setting of supervised learning, previously studied in the literature. In other words, learning can be performed with partial feedback without any generalization cost. The algorithms balance a trade-off between generalization (similar to structural risk minimization) and personalization (tailoring actions to specific contexts). Lastly, we consider the case of added continuity assumptions on rewards and show that these lead to universal consistency for significantly larger classes of data-generating processes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/14/2023

Non-stationary Contextual Bandits and Universal Learning

We study the fundamental limits of learning in contextual bandits, where...
research
03/09/2022

Universal Regression with Adversarial Responses

We provide algorithms for regression with adversarial responses under la...
research
10/18/2022

Contextual bandits with concave rewards, and an application to fair ranking

We consider Contextual Bandits with Concave Rewards (CBCR), a multi-obje...
research
03/03/2017

Contextual Multi-armed Bandits under Feature Uncertainty

We study contextual multi-armed bandit problems under linear realizabili...
research
10/21/2022

Optimal Contextual Bandits with Knapsacks under Realizibility via Regression Oracles

We study the stochastic contextual bandit with knapsacks (CBwK) problem,...
research
04/26/2022

Rate-Constrained Remote Contextual Bandits

We consider a rate-constrained contextual multi-armed bandit (RC-CMAB) p...
research
07/01/2019

Exploiting Relevance for Online Decision-Making in High-Dimensions

Many sequential decision-making tasks require choosing at each decision ...

Please sign up or login with your details

Forgot password? Click here to reset