A maximum-entropy approach to off-policy evaluation in average-reward MDPs

06/17/2020
by   Nevena Lazic, et al.
0

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/17/2021

Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation

We study the reinforcement learning for finite-horizon episodic Markov d...
research
01/31/2022

Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

We study regret minimization for infinite-horizon average-reward Markov ...
research
06/07/2023

Convergence of SARSA with linear function approximation: The random horizon case

The reinforcement learning algorithm SARSA combined with linear function...
research
01/28/2022

Do You Need the Entropy Reward (in Practice)?

Maximum entropy (MaxEnt) RL maximizes a combination of the original task...
research
03/02/2022

Learning Efficiently Function Approximation for Contextual MDP

We study learning contextual MDPs using a function approximation for bot...
research
02/11/2015

Off-policy evaluation for MDPs with unknown structure

Off-policy learning in dynamic decision problems is essential for provid...
research
05/30/2023

Sharp high-probability sample complexities for policy evaluation with linear function approximation

This paper is concerned with the problem of policy evaluation with linea...

Please sign up or login with your details

Forgot password? Click here to reset