Learning Adversarial Markov Decision Processes with Delayed Feedback

12/29/2020
by   Tal Lancewicki, et al.
0

Reinforcement learning typically assumes that the agent observes feedback from the environment immediately, but in many real-world applications (like recommendation systems) the feedback is observed in delay. Thus, we consider online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode k are only available at the end of episode k + d^k, where the delays d^k are neither identical nor bounded, and are chosen by an adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of O ( √(K) + √(D) ) under full-information feedback, where K is the number of episodes and D = ∑_k d^k is the total delay. Under bandit feedback, we prove similar O ( √(K) + √(D) ) regret assuming that the costs are stochastic, and O ( K^2/3 + D^2/3 ) regret in the general case. To our knowledge, we are the first to consider the important setting of delayed feedback in adversarial MDPs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2022

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

The standard assumption in reinforcement learning (RL) is that agents ob...
research
05/07/2020

Reinforcement Learning with Feedback Graphs

We study episodic reinforcement learning in Markov decision processes wh...
research
01/23/2019

Learning to Collaborate in Markov Decision Processes

We consider a two-agent MDP framework where agents repeatedly solve a ta...
research
12/03/2019

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

We consider the problem of learning in episodic finite-horizon Markov de...
research
02/21/2017

Fast rates for online learning in Linearly Solvable Markov Decision Processes

We study the problem of online learning in a class of Markov decision pr...
research
05/26/2022

Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes with Bandit Feedback

We consider regret minimization for Adversarial Markov Decision Processe...
research
05/15/2023

A Unified Analysis of Nonstochastic Delayed Feedback for Combinatorial Semi-Bandits, Linear Bandits, and MDPs

We derive a new analysis of Follow The Regularized Leader (FTRL) for onl...

Please sign up or login with your details

Forgot password? Click here to reset