Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver

10/07/2021

∙

Although model-based reinforcement learning (RL) approaches are considered more sample efficient, existing algorithms are usually relying on sophisticated planning algorithm to couple tightly with the model-learning procedure. Hence the learned models may lack the ability of being re-used with more specialized planners. In this paper we address this issue and provide approaches to learn an RL model efficiently without the guidance of a reward signal. In particular, we take a plug-in solver approach, where we focus on learning a model in the exploration phase and demand that any planning algorithm on the learned model can give a near-optimal policy. Specicially, we focus on the linear mixture MDP setting, where the probability transition matrix is a (unknown) convex combination of a set of existing models. We show that, by establishing a novel exploration algorithm, the plug-in approach learns a model by taking Õ(d^2H^3/ϵ^2) interactions with the environment and any ϵ-optimal planner on the model gives an O(ϵ)-optimal policy on the original model. This sample complexity matches lower bounds for non-plug-in approaches and is statistically optimal. We achieve this result by leveraging a careful maximum total-variance bound using Bernstein inequality and properties specified to linear mixture MDP.

READ FULL TEXT

Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver

Sign in with Google

Consider DeepAI Pro