Randomized Exploration for Reinforcement Learning with General Value Function Approximation

by   Haque Ishfaq, et al.

We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class ℱ, our algorithm achieves a worst-case regret bound of O(poly(d_EH)√(T)) where T is the time elapsed, H is the planning horizon and d_E is the eluder dimension of ℱ. In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an 𝒪(√(d^3H^3T)) regret. We complement the theory with an empirical evaluation across known difficult exploration tasks.


page 1

page 2

page 3

page 4


Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

We consider the exploration-exploitation dilemma in finite-horizon reinf...

Online Sub-Sampling for Reinforcement Learning with General Function Approximation

Designing provably efficient algorithms with general function approximat...

Worst-Case Regret Bounds for Exploration via Randomized Value Functions

This paper studies a recent proposal to use randomized value functions t...

Parameterized Indexed Value Function for Efficient Exploration in Reinforcement Learning

It is well known that quantifying uncertainty in the action-value estima...

Randomised Bayesian Least-Squares Policy Iteration

We introduce Bayesian least-squares policy iteration (BLSPI), an off-pol...

Exploration in Model-based Reinforcement Learning with Randomized Reward

Model-based Reinforcement Learning (MBRL) has been widely adapted due to...

Least Square Value Iteration is Robust Under Locally Bounded Misspecification Error

The success of reinforcement learning heavily relies on the function app...

Please sign up or login with your details

Forgot password? Click here to reset