Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

01/31/2022
by   Liyu Chen, et al.
0

We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures O(√(T)) regret and constant constraint violation, where T is the total number of time steps. This strictly improves over the algorithm of (Singh et al., 2020), whose regret and constraint violation are both O(T^2/3). Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with O(T^2/3) regret and constraint violation, which can be further improved to O(√(T)) via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/23/2020

Learning Infinite-horizon Average-reward MDPs with Linear Function Approximation

We develop several new algorithms for learning Markov Decision Processes...
research
10/27/2021

Reinforcement Learning in Linear MDPs: Constant Regret and Representation Selection

We study the role of the representation of state-action value functions ...
research
08/26/2022

Dynamic Regret of Online Markov Decision Processes

We investigate online Markov Decision Processes (MDPs) with adversariall...
research
02/25/2021

Improved Regret Bound and Experience Replay in Regularized Policy Iteration

In this work, we study algorithms for learning in infinite-horizon undis...
research
09/09/2020

Improved Exploration in Factored Average-Reward MDPs

We consider a regret minimization task under the average-reward criterio...
research
06/17/2020

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

This work focuses on off-policy evaluation (OPE) with function approxima...
research
10/19/2021

Planning for Package Deliveries in Risky Environments Over Multiple Epochs

We study a risk-aware robot planning problem where a dispatcher must con...

Please sign up or login with your details

Forgot password? Click here to reset