Off-policy evaluation for MDPs with unknown structure

02/11/2015
by   Assaf Hallak, et al.
0

Off-policy learning in dynamic decision problems is essential for providing strong evidence that a new policy is better than the one in use. But how can we prove superiority without testing the new policy? To answer this question, we introduce the G-SCOPE algorithm that evaluates a new policy based on data generated by the existing policy. Our algorithm is both computationally and sample efficient because it greedily learns to exploit factored structure in the dynamics of the environment. We present a finite sample analysis of our approach and show through experiments that the algorithm scales well on high-dimensional problems with few samples.

READ FULL TEXT
research
09/21/2018

Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting

In reinforcement learning (RL) , one of the key components is policy eva...
research
01/09/2023

Minimax Weight Learning for Absorbing MDPs

Reinforcement learning policy evaluation problems are often modeled as f...
research
08/17/2022

Nearly Optimal Latent State Decoding in Block MDPs

We investigate the problems of model estimation and reward-free learning...
research
05/23/2018

Representation Balancing MDPs for Off-Policy Policy Evaluation

We study the problem of off-policy policy evaluation (OPPE) in RL. In co...
research
06/17/2020

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

This work focuses on off-policy evaluation (OPE) with function approxima...
research
11/07/2018

Policy Certificates: Towards Accountable Reinforcement Learning

The performance of a reinforcement learning algorithm can vary drastical...
research
01/20/2023

Offline Policy Evaluation with Out-of-Sample Guarantees

We consider the problem of evaluating the performance of a decision poli...

Please sign up or login with your details

Forgot password? Click here to reset