Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

by   Andrea Zanette, et al.

We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL). When the state space is large or continuous, traditional tabular approaches are unfeasible and some form of function approximation is mandatory. In this paper, we introduce an optimistically-initialized variant of the popular randomized least-squares value iteration (RLSVI), a model-free algorithm where exploration is induced by perturbing the least-squares approximation of the action-value function. Under the assumption that the Markov decision process has low-rank transition dynamics, we prove that the frequentist regret of RLSVI is upper-bounded by O(d^2 H^2 √(T)) where d are the feature dimension, H is the horizon, and T is the total number of steps. To the best of our knowledge, this is the first frequentist regret analysis for randomized exploration with function approximation.


page 1

page 2

page 3

page 4


Randomized Exploration for Reinforcement Learning with General Value Function Approximation

We propose a model-free reinforcement learning algorithm inspired by the...

Improved Worst-Case Regret Bounds for Randomized Least-Squares Value Iteration

This paper studies regret minimization with randomized value functions i...

Provably Efficient Exploration in Policy Optimization

While policy-based reinforcement learning (RL) achieves tremendous succe...

No-Regret Reinforcement Learning with Value Function Approximation: a Kernel Embedding Approach

We consider the regret minimisation problem in reinforcement learning (R...

Feedback-Based Tree Search for Reinforcement Learning

Inspired by recent successes of Monte-Carlo tree search (MCTS) in a numb...

Least Square Value Iteration is Robust Under Locally Bounded Misspecification Error

The success of reinforcement learning heavily relies on the function app...

Successor Uncertainties: exploration and uncertainty in temporal difference learning

We consider the problem of balancing exploration and exploitation in seq...

Please sign up or login with your details

Forgot password? Click here to reset