Acting in Delayed Environments with Non-Stationary Markov Policies

by   Esther Derman, et al.

The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of m steps. The brute-force state augmentation baseline where the state is concatenated to the last m committed actions suffers from an exponential complexity in m, as we show for policy iteration. We then prove that with execution delay, Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code is available at


page 1

page 2

page 3

page 4


Delay-Aware Model-Based Reinforcement Learning for Continuous Control

Action delays degrade the performance of reinforcement learning in many ...

Synthesizing Policies That Account For Human Execution Errors Caused By State-Aliasing In Markov Decision Processes

When humans are given a policy to execute, there can be policy execution...

Constraint Satisfaction Propagation: Non-stationary Policy Synthesis for Temporal Logic Planning

Problems arise when using reward functions to capture dependencies betwe...

Existence and Finiteness Conditions for Risk-Sensitive Planning: Results and Conjectures

Decision-theoretic planning with risk-sensitive planning objectives is i...

Optimizing for the Future in Non-Stationary MDPs

Most reinforcement learning methods are based upon the key assumption th...

Minimum-Delay Adaptation in Non-Stationary Reinforcement Learning via Online High-Confidence Change-Point Detection

Non-stationary environments are challenging for reinforcement learning a...

RUDDER: Return Decomposition for Delayed Rewards

We propose a novel reinforcement learning approach for finite Markov dec...

Please sign up or login with your details

Forgot password? Click here to reset