Provably Efficient Adaptive Approximate Policy Iteration

02/08/2020
by   Botao Hao, et al.
15

Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains, including games and robotics. However, the theoretical understanding of such algorithms is limited, and existing results are largely focused on episodic or discounted Markov decision processes (MDPs). In this work, we present adaptive approximate policy iteration (AAPI), a learning scheme which enjoys a O(T^2/3) regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvementover the best existing bound of O(T^3/4) for the average-reward case with function approximation. Our algorithm and analysis rely on adversarialonline learning techniques, where value functionsare treated as losses. The main technical novelty is the use of a data-dependent adaptive learning rate coupled with a so-called optimistic prediction of upcoming losses. In addition to theoreticalguarantees, we demonstrate the advantages of ourapproach empirically on several environments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/28/2023

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

We develop several provably efficient model-free reinforcement learning ...
research
06/29/2020

Learning and Planning in Average-Reward Markov Decision Processes

We introduce improved learning and planning algorithms for average-rewar...
research
06/12/2022

Geometric Policy Iteration for Markov Decision Processes

Recently discovered polyhedral structures of the value function for fini...
research
02/25/2021

Improved Regret Bound and Experience Replay in Regularized Policy Iteration

In this work, we study algorithms for learning in infinite-horizon undis...
research
10/13/2022

Reinforcement Learning with Unbiased Policy Evaluation and Linear Function Approximation

We provide performance guarantees for a variant of simulation-based poli...
research
06/08/2020

Randomized Policy Learning for Continuous State and Action MDPs

Deep reinforcement learning methods have achieved state-of-the-art resul...
research
06/07/2022

Concentration bounds for SSP Q-learning for average cost MDPs

We derive a concentration bound for a Q-learning algorithm for average c...

Please sign up or login with your details

Forgot password? Click here to reset