Dynamic Regret of Online Markov Decision Processes

08/26/2022
by   Peng Zhao, et al.
1

We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner's performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For these three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure. Furthermore, when the online environments encountered by the learner are predictable, we design improved algorithms and achieve better dynamic regret bounds for the episodic (loop-free) SSP; and moreover, we demonstrate impossibility results for the infinite-horizon MDPs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/19/2019

Online Convex Optimization in Adversarial Markov Decision Processes

We consider online learning in episodic loop-free Markov decision proces...
research
01/31/2022

Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

We study regret minimization for infinite-horizon average-reward Markov ...
research
12/08/2020

Minimax Regret Optimisation for Robust Planning in Uncertain Markov Decision Processes

The parameters for a Markov Decision Process (MDP) often cannot be speci...
research
06/15/2012

Simple Regret Optimization in Online Planning for Markov Decision Processes

We consider online planning in Markov decision processes (MDPs). In onli...
research
02/21/2017

Fast rates for online learning in Linearly Solvable Markov Decision Processes

We study the problem of online learning in a class of Markov decision pr...
research
01/31/2021

Online Markov Decision Processes with Aggregate Bandit Feedback

We study a novel variant of online finite-horizon Markov Decision Proces...
research
05/26/2022

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

We consider regret minimization for Adversarial Markov Decision Processe...

Please sign up or login with your details

Forgot password? Click here to reset