Geometric Policy Iteration for Markov Decision Processes

06/12/2022
by   Yue Wu, et al.
9

Recently discovered polyhedral structures of the value function for finite state-action discounted Markov decision processes (MDP) shed light on understanding the success of reinforcement learning. We investigate the value function polytope in greater detail and characterize the polytope boundary using a hyperplane arrangement. We further show that the value space is a union of finitely many cells of the same hyperplane arrangement and relate it to the polytope of the classical linear programming formulation for MDPs. Inspired by these geometric properties, we propose a new algorithm, Geometric Policy Iteration (GPI), to solve discounted MDPs. GPI updates the policy of a single state by switching to an action that is mapped to the boundary of the value function polytope, followed by an immediate update of the value function. This new update rule aims at a faster value improvement without compromising computational efficiency. Moreover, our algorithm allows asynchronous updates of state values which is more flexible and advantageous compared to traditional policy iteration when the state set is large. We prove that the complexity of GPI achieves the best known bound ||/1 - γlog1/1-γ of policy iteration and empirically demonstrate the strength of GPI on MDPs of various sizes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/22/2021

Variance-Aware Off-Policy Evaluation with Linear Function Approximation

We study the off-policy evaluation (OPE) problem in reinforcement learni...
research
07/23/2021

An Adaptive State Aggregation Algorithm for Markov Decision Processes

Value iteration is a well-known method of solving Markov Decision Proces...
research
06/03/2020

Kernel Taylor-Based Value Function Approximation for Continuous-State Markov Decision Processes

We propose a principled kernel-based policy iteration algorithm to solve...
research
07/04/2022

Doubly-Asynchronous Value Iteration: Making Value Iteration Asynchronous in Actions

Value iteration (VI) is a foundational dynamic programming method, impor...
research
02/08/2020

Provably Efficient Adaptive Approximate Policy Iteration

Model-free reinforcement learning algorithms combined with value functio...
research
07/15/2022

Set-based value operators for non-stationary Markovian environments

This paper analyzes finite state Markov Decision Processes (MDPs) with u...
research
11/29/2015

Exploiting Anonymity in Approximate Linear Programming: Scaling to Large Multiagent MDPs (Extended Version)

Many exact and approximate solution methods for Markov Decision Processe...

Please sign up or login with your details

Forgot password? Click here to reset