Differentiable Meta-Learning in Contextual Bandits

by   Branislav Kveton, et al.

We study a contextual bandit setting where the learning agent has access to sampled bandit instances from an unknown prior distribution P. The goal of the agent is to achieve high reward on average over the instances drawn from P. This setting is of a particular importance because it formalizes the offline optimization of bandit policies, to perform well on average over anticipated bandit instances. The main idea in our work is to optimize differentiable bandit policies by policy gradients. We derive reward gradients that reflect the structure of our problem, and propose contextual policies that are parameterized in a differentiable way and have low regret. Our algorithmic and theoretical contributions are supported by extensive experiments that show the importance of baseline subtraction, learned biases, and the practicality of our approach on a range of classification tasks.


Differentiable Bandit Exploration

We learn bandit policies that maximize the average reward over bandit in...

Latent Bandits Revisited

A latent bandit problem is one in which the learning agent knows the arm...

Learning Across Bandits in High Dimension via Robust Statistics

Decision-makers often face the "many bandits" problem, where one must si...

Policy Gradients for Contextual Bandits

We study a generalized contextual-bandits problem, where there is a stat...

DORB: Dynamically Optimizing Multiple Rewards with Bandits

Policy gradients-based reinforcement learning has proven to be a promisi...

Multi-Task Off-Policy Learning from Bandit Feedback

Many practical applications, such as recommender systems and learning to...

Please sign up or login with your details

Forgot password? Click here to reset