Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space

by   Jonatha Anselmi, et al.

In this paper, we revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure. Specifically, we consider a controlled queue with impatient jobs and the main objective is to optimize a trade-off between energy consumption and user-perceived performance. Within this setting, the diameter D of the MDP is Ω(S^S), where S is the number of states. Therefore, the existing lower and upper bounds on the regret at timeT, of order O(√(DSAT)) for MDPs with S states and A actions, may suggest that reinforcement learning is inefficient here. In our main result however, we exploit the structure of our MDPs to show that the regret of a slightly-tweaked version of the classical learning algorithm Ucrl2 is in fact upper bounded by 𝒪̃(√(E_2AT)) where E_2 is related to the weighted second moment of the stationary measure of a reference policy. Importantly, E_2 is bounded independently of S. Thus, our bound is asymptotically independent of the number of states and of the diameter. This result is based on a careful study of the number of visits performed by the learning algorithm to the states of the MDP, which is highly non-uniform.


page 1

page 2

page 3

page 4


Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

The problem of reinforcement learning in an unknown and discrete Markov ...

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Modern tasks in reinforcement learning are always with large state and a...

Regret Bounds for Discounted MDPs

Recently, it has been shown that carefully designed reinforcement learni...

Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

We introduce SCAL, an algorithm designed to perform efficient exploratio...

Extreme State Aggregation Beyond MDPs

We consider a Reinforcement Learning setup where an agent interacts with...

Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Given a Markov Decision Process (MDP) with n states and a totalnumber m ...

Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management

We consider a stochastic inventory control problem under censored demand...

Please sign up or login with your details

Forgot password? Click here to reset