No-Regret Exploration in Goal-Oriented Reinforcement Learning

12/07/2019
by   Jean Tarbouriech, et al.
0

Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the so-called episodic setting or stochastic shortest path (SSP) problem, where an agent has to achieve a predefined goal state (e.g., the top of the hill) while maximizing the cumulative reward or minimizing the cumulative cost. Despite its popularity, most of the literature studying the exploration-exploitation dilemma either focused on different problems (i.e., fixed-horizon and infinite-horizon) or made the restrictive loop-free assumption (which implies that no same state can be visited twice during any episode). In this paper, we study the general SSP setting and introduce the algorithm UC-SSP whose regret scales as O(c_max^3/2 c_min^-1/2 D S √( A D K)) after K episodes for any unknown SSP with S non-terminal states, A actions, an SSP-diameter of D and positive costs in [c_min, c_max]. UC-SSP is thus the first learning algorithm with vanishing regret in the theoretically challenging setting of episodic RL.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/22/2021

Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

We study the problem of learning in the stochastic shortest path (SSP) s...
research
02/23/2020

Near-optimal Regret Bounds for Stochastic Shortest Path

Stochastic shortest path (SSP) is a well-known problem in planning and c...
research
06/07/2017

Efficient Reinforcement Learning via Initial Pure Exploration

In several realistic situations, an interactive learning agent can pract...
research
02/12/2020

Regret Bounds for Discounted MDPs

Recently, it has been shown that carefully designed reinforcement learni...
research
01/13/2023

Decentralized model-free reinforcement learning in stochastic games with average-reward objective

We propose the first model-free algorithm that achieves low regret perfo...
research
07/13/2020

A Provably Efficient Sample Collection Strategy for Reinforcement Learning

A common assumption in reinforcement learning (RL) is to have access to ...
research
02/07/2022

On learning Whittle index policy for restless bandits with scalable regret

Reinforcement learning is an attractive approach to learn good resource ...

Please sign up or login with your details

Forgot password? Click here to reset