A Unified Framework for Conservative Exploration

by   Yunchang Yang, et al.

We study bandits and reinforcement learning (RL) subject to a conservative constraint where the agent is asked to perform at least as well as a given baseline policy. This setting is particular relevant in real-world domains including digital marketing, healthcare, production, finance, etc. For multi-armed bandits, linear bandits and tabular RL, specialized algorithms and theoretical analyses were proposed in previous work. In this paper, we present a unified framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy. For lower bounds, our framework gives a black-box reduction that turns a certain lower bound in the nonconservative setting into a new lower bound in the conservative setting. We strengthen the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP. For upper bounds, our framework turns a certain nonconservative upper-confidence-bound (UCB) algorithm into a conservative algorithm with a simple analysis. For multi-armed bandits, linear bandits and tabular RL, our new upper bounds tighten or match existing ones with significantly simpler analyses. We also obtain a new upper bound for conservative low-rank MDP.


page 1

page 2

page 3

page 4


Pure exploration in multi-armed bandits with low rank structure using oblivious sampler

In this paper, we consider the low rank structure of the reward sequence...

An Analysis of Reinforcement Learning for Malaria Control

Previous work on policy learning for Malaria control has often formulate...

Improved Algorithms for Conservative Exploration in Bandits

In many fields such as digital marketing, healthcare, finance, and robot...

Contextual Combinatorial Conservative Bandits

The problem of multi-armed bandits (MAB) asks to make sequential decisio...

A Unified Framework for Marketing Budget Allocation

While marketing budget allocation has been studied for decades in tradit...

Hallucinated Adversarial Control for Conservative Offline Policy Evaluation

We study the problem of conservative off-policy evaluation (COPE) where ...

Introduction to Multi-Armed Bandits

Multi-armed bandits a simple but very powerful framework for algorithms ...

Please sign up or login with your details

Forgot password? Click here to reset