Efficient Reinforcement Learning via Initial Pure Exploration

by   Sudeep Raja Putta, et al.

In several realistic situations, an interactive learning agent can practice and refine its strategy before going on to be evaluated. For instance, consider a student preparing for a series of tests. She would typically take a few practice tests to know which areas she needs to improve upon. Based of the scores she obtains in these practice tests, she would formulate a strategy for maximizing her scores in the actual tests. We treat this scenario in the context of an agent exploring a fixed-horizon episodic Markov Decision Process (MDP), where the agent can practice on the MDP for some number of episodes (not necessarily known in advance) before starting to incur regret for its actions. During practice, the agent's goal must be to maximize the probability of following an optimal policy. This is akin to the problem of Pure Exploration (PE). We extend the PE problem of Multi Armed Bandits (MAB) to MDPs and propose a Bayesian algorithm called Posterior Sampling for Pure Exploration (PSPE), which is similar to its bandit counterpart. We show that the Bayesian simple regret converges at an optimal exponential rate when using PSPE. When the agent starts being evaluated, its goal would be to minimize the cumulative regret incurred. This is akin to the problem of Reinforcement Learning (RL). The agent uses the Posterior Sampling for Reinforcement Learning algorithm (PSRL) initialized with the posteriors of the practice phase. We hypothesize that this PSPE + PSRL combination is an optimal strategy for minimizing regret in RL problems with an initial practice phase. We show empirical results which prove that having a lower simple regret at the end of the practice phase results in having lower cumulative regret during evaluation.


page 1

page 2

page 3

page 4


Reinforcement Learning for Finite-Horizon Restless Multi-Armed Multi-Action Bandits

We study a finite-horizon restless multi-armed bandit problem with multi...

No-Regret Exploration in Goal-Oriented Reinforcement Learning

Many popular reinforcement learning problems (e.g., navigation in a maze...

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

The standard assumption in reinforcement learning (RL) is that agents ob...

Provably Efficient Lifelong Reinforcement Learning with Linear Function Approximation

We study lifelong reinforcement learning (RL) in a regret minimization s...

Posterior Sampling for Large Scale Reinforcement Learning

Posterior sampling for reinforcement learning (PSRL) is a popular algori...

Adversarial Online Multi-Task Reinforcement Learning

We consider the adversarial online multi-task reinforcement learning set...

Boltzmann Exploration Done Right

Boltzmann exploration is a classic strategy for sequential decision-maki...

Please sign up or login with your details

Forgot password? Click here to reset