Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs

by   Junkai Zhang, et al.

We study reward-free reinforcement learning (RL) with linear function approximation, where the agent works in two phases: (1) in the exploration phase, the agent interacts with the environment but cannot access the reward; and (2) in the planning phase, the agent is given a reward function and is expected to find a near-optimal policy based on samples collected in the exploration phase. The sample complexities of existing reward-free algorithms have a polynomial dependence on the planning horizon, which makes them intractable for long planning horizon RL problems. In this paper, we propose a new reward-free algorithm for learning linear mixture Markov decision processes (MDPs), where the transition probability can be parameterized as a linear combination of known feature mappings. At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward and a high-order moment estimator for the aleatoric and epistemic uncertainties. When the total reward is bounded by 1, we show that our algorithm only needs to explore Õ( d^2ε^-2) episodes to find an ε-optimal policy, where d is the dimension of the feature mapping. The sample complexity of our algorithm only has a polylogarithmic dependence on the planning horizon and therefore is “horizon-free”. In addition, we provide an Ω(d^2ε^-2) sample complexity lower bound, which matches the sample complexity of our algorithm up to logarithmic factors, suggesting that our algorithm is optimal.


Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

We study the model-based reward-free reinforcement learning with linear ...

On Reward-Free Reinforcement Learning with Linear Function Approximation

Reward-free reinforcement learning (RL) is a framework which is suitable...

Improved Sample Complexity for Reward-free Reinforcement Learning under Low-rank MDPs

In reward-free reinforcement learning (RL), an agent explores the enviro...

Nearly Minimax Optimal Reward-free Reinforcement Learning

We study the reward-free reinforcement learning framework, which is part...

Near-Optimal Reward-Free Exploration for Linear Mixture MDPs with Plug-in Solver

Although model-based reinforcement learning (RL) approaches are consider...

Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL

While the primary goal of the exploration phase in reward-free reinforce...

Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

Recent studies have shown that episodic reinforcement learning (RL) is n...

Please sign up or login with your details

Forgot password? Click here to reset