Provably Efficient Maximum Entropy Exploration

by   Elad Hazan, et al.
Princeton University
University of Washington

Suppose an agent is in a (possibly unknown) Markov decision process (MDP) in the absence of a reward signal, what might we hope that an agent can efficiently learn to do? One natural, intrinsically defined, objective problem is for the agent to learn a policy which induces a distribution over state space that is as uniform as possible, which can be measured in an entropic sense. Despite the corresponding mathematical program being non-convex, our main result provides a provably efficient method (both in terms of sample size and computational complexity) to construct such a maximum-entropy exploratory policy. Key to our algorithmic methodology is utilizing the conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes an approximate MDP solver.


page 1

page 2

page 3

page 4


Active Model Estimation in Markov Decision Processes

We study the problem of efficient exploration in order to learn an accur...

Entropy Maximization for Markov Decision Processes Under Temporal Logic Constraints

We study the problem of synthesizing a policy that maximizes the entropy...

A Relation Analysis of Markov Decision Process Frameworks

We study the relation between different Markov Decision Process (MDP) fr...

Fast Rates for Maximum Entropy Exploration

We consider the reinforcement learning (RL) setting, in which the agent ...

An Incremental Off-policy Search in a Model-free Markov Decision Process Using a Single Sample Path

In this paper, we consider a modified version of the control problem in ...

Please sign up or login with your details

Forgot password? Click here to reset