Pure Exploration under Mediators' Feedback

by   Riccardo Poiani, et al.

Stochastic multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a stochastic reward. Within the context of best-arm identification (BAI) problems, the goal of the agent lies in finding the optimal arm, i.e., the one with highest expected reward, as accurately and efficiently as possible. Nevertheless, the sequential interaction protocol of classical BAI problems, where the agent has complete control over the arm being pulled at each round, does not effectively model several decision-making problems of interest (e.g., off-policy learning, partially controllable environments, and human feedback). For this reason, in this work, we propose a novel strict generalization of the classical BAI problem that we refer to as best-arm identification under mediators' feedback (BAI-MF). More specifically, we consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a stochastic and possibly unknown policy. The mediator, then, communicates back to the agent the pulled arm together with the observed reward. In this setting, the agent's goal lies in sequentially choosing which mediator to query to identify with high probability the optimal arm while minimizing the identification time, i.e., the sample complexity. To this end, we first derive and analyze a statistical lower bound on the sample complexity specific to our general mediator feedback scenario. Then, we propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner. As our theory verifies, this algorithm matches the lower bound both almost surely and in expectation. Finally, we extend these results to cases where the mediators' policies are unknown to the learner obtaining comparable results.


page 1

page 2

page 3

page 4


Good Arm Identification via Bandit Feedback

In this paper, we consider and discuss a new stochastic multi-armed band...

Best Arm Identification in Bandits with Limited Precision Sampling

We study best arm identification in a variant of the multi-armed bandit ...

Best arm identification in multi-armed bandits with delayed feedback

We propose a generalization of the best arm identification problem in st...

Stochastic Online Learning with Probabilistic Graph Feedback

We consider a problem of stochastic online learning with general probabi...

Interaction-Grounded Learning

Consider a prosthetic arm, learning to adapt to its user's control signa...

Optimal Odd Arm Identification with Fixed Confidence

The problem of detecting an odd arm from a set of K arms of a multi-arme...

PAC Best Arm Identification Under a Deadline

We study (ϵ, δ)-PAC best arm identification, where a decision-maker must...

Please sign up or login with your details

Forgot password? Click here to reset