Posterior Sampling for Large Scale Reinforcement Learning
Posterior sampling for reinforcement learning (PSRL) is a popular algorithm for learning to control an unknown Markov decision process (MDP). PSRL maintains a distribution over MDP parameters and in an episodic fashion samples MDP parameters, computes the optimal policy for them and executes it. A special case of PSRL is where at the end of each episode the MDP resets to the initial state distribution. Extensions of this idea to general MDPs without state resetting has so far produced non-practical algorithms and in some cases buggy theoretical analysis. This is due to the difficulty of analyzing regret under episode switching schedules that depend on random variables of the true underlying model. We propose a solution to this problem that involves using a deterministic, model-independent episode switching schedule, and establish a Bayes regret bound under mild assumptions. Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity. Our result is more generally applicable to continuous state action problems. We demonstrate how this algorithm is well suited for sequential recommendation problems such as points of interest (POI). We derive a general procedure for parameterizing the underlying MDPs, to create action condition dynamics from passive data, that do not contain actions. We prove that such parameterization satisfies the assumptions of our analysis.
READ FULL TEXT