Upper Counterfactual Confidence Bounds: a New Optimism Principle for Contextual Bandits

07/15/2020
by   Yunbei Xu, et al.
0

The principle of optimism in the face of uncertainty is one of the most widely used and successful ideas in multi-armed bandits and reinforcement learning. However, existing optimistic algorithms (primarily UCB and its variants) are often unable to deal with large context spaces. Essentially all existing well performing algorithms for general contextual bandit problems rely on weighted action allocation schemes; and theoretical guarantees for optimism-based algorithms are only known for restricted formulations. In this paper we study general contextual bandits under the realizability condition, and propose a simple generic principle to design optimistic algorithms, dubbed "Upper Counterfactual Confidence Bounds" (UCCB). We show that these algorithms are provably optimal and efficient in the presence of large context spaces. Key components of UCCB include: 1) a systematic analysis of confidence bounds in policy space rather than in action space; and 2) the potential function perspective that is used to express the power of optimism in the contextual setting. We further show how the UCCB principle can be extended to infinite action spaces, by constructing confidence bounds via the newly introduced notion of "counterfactual action divergence."

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/19/2021

An Analysis of Reinforcement Learning for Malaria Control

Previous work on policy learning for Malaria control has often formulate...
research
06/29/2021

Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual Bandit

Balancing exploration and exploitation (EE) is a fundamental problem in ...
research
07/12/2022

Contextual Bandits with Smooth Regret: Efficient Learning in Continuous Action Spaces

Designing efficient general-purpose contextual bandit algorithms that wo...
research
07/10/2019

Productization Challenges of Contextual Multi-Armed Bandits

Contextual Multi-Armed Bandits is a well-known and accepted online optim...
research
07/12/2022

Contextual Bandits with Large Action Spaces: Made Practical

A central problem in sequential decision making is to develop algorithms...
research
11/13/2020

Improving Offline Contextual Bandits with Distributional Robustness

This paper extends the Distributionally Robust Optimization (DRO) approa...
research
11/20/2019

Corruption Robust Exploration in Episodic Reinforcement Learning

We initiate the study of multi-stage episodic reinforcement learning und...

Please sign up or login with your details

Forgot password? Click here to reset