Active Coverage for PAC Reinforcement Learning

by   Aymen Al Marjani, et al.

Collecting and leveraging data with good coverage properties plays a crucial role in different aspects of reinforcement learning (RL), including reward-free exploration and offline learning. However, the notion of "good coverage" really depends on the application at hand, as data suitable for one context may not be so for another. In this paper, we formalize the problem of active coverage in episodic Markov decision processes (MDPs), where the goal is to interact with the environment so as to fulfill given sampling requirements. This framework is sufficiently flexible to specify any desired coverage property, making it applicable to any problem that involves online exploration. Our main contribution is an instance-dependent lower bound on the sample complexity of active coverage and a simple game-theoretic algorithm, CovGame, that nearly matches it. We then show that CovGame can be used as a building block to solve different PAC RL tasks. In particular, we obtain a simple algorithm for PAC reward-free exploration with an instance-dependent sample complexity that, in certain MDPs which are "easy to explore", is lower than the minimax one. By further coupling this exploration algorithm with a new technique to do implicit eliminations in policy space, we obtain a computationally-efficient algorithm for best-policy identification whose instance-dependent sample complexity scales with gaps between policy values.


page 1

page 2

page 3

page 4


Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes

Reward-free reinforcement learning (RL) considers the setting where the ...

Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs

In probably approximately correct (PAC) reinforcement learning (RL), an ...

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism

We study the offline reinforcement learning (offline RL) problem, where ...

Optimistic PAC Reinforcement Learning: the Instance-Dependent View

Optimistic algorithms have been extensively studied for regret minimizat...

One Policy is Enough: Parallel Exploration with a Single Policy is Minimax Optimal for Reward-Free Reinforcement Learning

While parallelism has been extensively used in Reinforcement Learning (R...

The Role of Coverage in Online Reinforcement Learning

Coverage conditions – which assert that the data logging distribution ad...

A Hybrid PAC Reinforcement Learning Algorithm

This paper offers a new hybrid probably asymptotically correct (PAC) rei...

Please sign up or login with your details

Forgot password? Click here to reset