On Reinforcement Learning Using Monte Carlo Tree Search with Supervised Learning: Non-Asymptotic Analysis

02/14/2019
by   Devavrat Shah, et al.
0

Inspired by the success of AlphaGo Zero (AGZ) which utilizes Monte Carlo Tree Search (MCTS) with Supervised Learning via Neural Network to learn the optimal policy and value function, in this work, we focus on establishing formally that such an approach indeed finds optimal policy asymptotically, as well as establishing non-asymptotic guarantees in the process. We shall focus on infinite-horizon discounted Markov Decision Process to establish the results. To start with, it requires establishing the MCTS's claimed property in the literature that for any given query state, MCTS provides approximate value function for the state with enough simulation steps of MDP. We provide non-asymptotic analysis establishing this property by analyzing a non-stationary multi-arm bandit setup. Our proof suggests that MCTS needs to be utilized with polynomial rather than logarithmic "upper confidence bound" for establishing its desired performance -- interestingly enough, AGZ chooses such polynomial bound. Using this as a building block, combined with nearest neighbor supervised learning, we argue that MCTS acts as a "policy improvement" operator; it has a natural "bootstrapping" property to iteratively improve value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn ε approximation of value function in ℓ_∞ norm, MCTS combined with nearest-neighbors requires samples scaling as O(ε^-(d+4)), where d is the dimension of the state space. This is nearly optimal due to a minimax lower bound of Ω(ε^-(d+2)).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/25/2020

On Reinforcement Learning for Turn-based Zero-sum Markov Games

We consider the problem of finding Nash equilibrium for two-player turn-...
research
06/08/2020

POLY-HOOT: Monte-Carlo Planning in Continuous Space MDPs with Non-Asymptotic Analysis

Monte-Carlo planning, as exemplified by Monte-Carlo Tree Search (MCTS), ...
research
06/08/2020

Stable Reinforcement Learning with Unbounded State Space

We consider the problem of reinforcement learning (RL) with unbounded st...
research
07/02/2022

Reinforcement Learning Approaches for the Orienteering Problem with Stochastic and Dynamic Release Dates

In this paper, we study a sequential decision making problem faced by e-...
research
12/10/2019

A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation

Q-learning with neural network function approximation (neural Q-learning...
research
02/12/2018

Q-learning with Nearest Neighbors

We consider the problem of model-free reinforcement learning for infinit...
research
11/01/2019

Generalized Mean Estimation in Monte-Carlo Tree Search

We consider Monte-Carlo Tree Search (MCTS) applied to Markov Decision Pr...

Please sign up or login with your details

Forgot password? Click here to reset