Sample-Optimal Parametric Q-Learning with Linear Transition Models

02/13/2019
by   Lin F. Yang, et al.
0

Consider a Markov decision process (MDP) that admits a set of state-action features, which can linearly express the process's probabilistic transition model. We propose a parametric Q-learning algorithm that finds an approximate-optimal policy using a sample size proportional to the feature dimension K and invariant with respect to the size of the state space. To further improve its sample efficiency, we exploit the monotonicity property and intrinsic noise structure of the Bellman operator, provided the existence of anchor state-actions that imply implicit non-negativity in the feature space. We augment the algorithm using techniques of variance reduction, monotonicity preservation, and confidence bounds. It is proved to find a policy which is ϵ-optimal from any initial state with high probability using O(K/ϵ^2(1-γ)^3) sample transitions for arbitrarily large-scale MDP with a discount factor γ∈(0,1). A matching information-theoretical lower bound is proved, confirming the sample optimality of the proposed method with respect to all parameters (up to polylog factors).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2021

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

The curse of dimensionality is a widely known issue in reinforcement lea...
research
10/12/2020

Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning?

It is believed that a model-based approach for reinforcement learning (R...
research
06/10/2019

On the Optimality of Sparse Model-Based Planning for Markov Decision Processes

This work considers the sample complexity of obtaining an ϵ-optimal poli...
research
06/06/2020

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

In this paper we consider the problem of learning an ϵ-optimal policy fo...
research
07/29/2020

Modular Transfer Learning with Transition Mismatch Compensation for Excessive Disturbance Rejection

Underwater robots in shallow waters usually suffer from strong wave forc...
research
05/03/2021

Learning Good State and Action Representations via Tensor Decomposition

The transition kernel of a continuous-state-action Markov decision proce...
research
11/06/2018

State Aggregation Learning from Markov Transition Data

State aggregation is a model reduction method rooted in control theory a...

Please sign up or login with your details

Forgot password? Click here to reset