Posterior Sampling for Large Scale Reinforcement Learning

by   Georgios Theocharous, et al.

Posterior sampling for reinforcement learning (PSRL) is a popular algorithm for learning to control an unknown Markov decision process (MDP). PSRL maintains a distribution over MDP parameters and in an episodic fashion samples MDP parameters, computes the optimal policy for them and executes it. A special case of PSRL is where at the end of each episode the MDP resets to the initial state distribution. Extensions of this idea to general MDPs without state resetting has so far produced non-practical algorithms and in some cases buggy theoretical analysis. This is due to the difficulty of analyzing regret under episode switching schedules that depend on random variables of the true underlying model. We propose a solution to this problem that involves using a deterministic, model-independent episode switching schedule, and establish a Bayes regret bound under mild assumptions. Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity. Our result is more generally applicable to continuous state action problems. We demonstrate how this algorithm is well suited for sequential recommendation problems such as points of interest (POI). We derive a general procedure for parameterizing the underlying MDPs, to create action condition dynamics from passive data, that do not contain actions. We prove that such parameterization satisfies the assumptions of our analysis.


page 1

page 2

page 3

page 4


Near-optimal Reinforcement Learning in Factored MDPs

Any reinforcement learning algorithm that applies to all Markov decision...

Efficient Policy Learning for Non-Stationary MDPs under Adversarial Manipulation

A Markov Decision Process (MDP) is a popular model for reinforcement lea...

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

The problem of reinforcement learning in an unknown and discrete Markov ...

MDPs with Unawareness in Robotics

We formalize decision-making problems in robotics and automated control ...

Sequential Knockoffs for Variable Selection in Reinforcement Learning

In real-world applications of reinforcement learning, it is often challe...

Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management

We consider a stochastic inventory control problem under censored demand...

Efficient Reinforcement Learning via Initial Pure Exploration

In several realistic situations, an interactive learning agent can pract...

Please sign up or login with your details

Forgot password? Click here to reset