Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

12/29/2020
by   Jean Tarbouriech, et al.
0

We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of ϵ-optimal goal-conditioned policies attaining all states that are incrementally reachable within L steps (in expectation) from a reference state s_0. In this paper, we introduce a novel model-based approach that interleaves discovering new states from s_0 and improving the accuracy of a model estimate that is used to compute goal-conditioned policies to reach newly discovered states. The resulting algorithm, DisCo, achieves a sample complexity scaling as Õ(L^5 S_L+ϵΓ_L+ϵ A ϵ^-2), where A is the number of actions, S_L+ϵ is the number of states that are incrementally reachable from s_0 in L+ϵ steps, and Γ_L+ϵ is the branching factor of the dynamics over such states. This improves over the algorithm proposed in [1] in both ϵ and L at the cost of an extra Γ_L+ϵ factor, which is small in most environments of interest. Furthermore, DisCo is the first algorithm that can return an ϵ/c_min-optimal policy for any cost-sensitive shortest-path problem defined on the L-reachable states with minimum cost c_min. Finally, we report preliminary empirical results confirming our theoretical findings.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset