Stochastic Lipschitz Q-Learning

by   Xu Zhu, et al.

In an episodic Markov Decision Process (MDP) problem, an online algorithm chooses from a set of actions in a sequence of H trials, where H is the episode length, in order to maximize the total payoff of the chosen actions. Q-learning, as the most popular model-free reinforcement learning (RL) algorithm, directly parameterizes and updates value functions without explicitly modeling the environment. Recently, [Jin et al. 2018] studies the sample complexity of Q-learning with finite states and actions. Their algorithm achieves nearly optimal regret, which shows that Q-learning can be made sample efficient. However, MDPs with large discrete states and actions [Silver et al. 2016] or continuous spaces [Mnih et al. 2013] cannot learn efficiently in this way. Hence, it is critical to develop new algorithms to solve this dilemma with provable guarantee on the sample complexity. With this motivation, we propose a novel algorithm that works for MDPs with a more general setting, which has infinitely many states and actions and assumes that the payoff function and transition kernel are Lipschitz continuous. We also provide corresponding theory justification for our algorithm. It achieves the regret Õ(K^d+1/d+2√(H^3)), where K denotes the number of episodes and d denotes the dimension of the joint space. To the best of our knowledge, this is the first analysis in the model-free setting whose established regret matches the lower bound up to a logarithmic factor.


page 1

page 2

page 3

page 4


Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

We present an algorithm based on the Optimism in the Face of Uncertainty...

Is Q-learning Provably Efficient?

Model-free reinforcement learning (RL) algorithms, such as Q-learning, d...

Fast Rates for Maximum Entropy Exploration

We consider the reinforcement learning (RL) setting, in which the agent ...

Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

We study the reinforcement learning problem in the setting of finite-hor...

Lipschitz Bandit Optimization with Improved Efficiency

We consider the Lipschitz bandit optimization problem with an emphasis o...

Layered State Discovery for Incremental Autonomous Exploration

We study the autonomous exploration (AX) problem proposed by Lim Aue...

A Provably Efficient Sample Collection Strategy for Reinforcement Learning

A common assumption in reinforcement learning (RL) is to have access to ...

Please sign up or login with your details

Forgot password? Click here to reset