Manipulating the Distributions of Experience used for Self-Play Learning in Expert Iteration

05/30/2020
by   Dennis J. N. J. Soemers, et al.
0

Expert Iteration (ExIt) is an effective framework for learning game-playing policies from self-play. ExIt involves training a policy to mimic the search behaviour of a tree search algorithm - such as Monte-Carlo tree search - and using the trained policy to guide it. The policy and the tree search can then iteratively improve each other, through experience gathered in self-play between instances of the guided tree search algorithm. This paper outlines three different approaches for manipulating the distribution of data collected from self-play, and the procedure that samples batches for learning updates from the collected data. Firstly, samples in batches are weighted based on the durations of the episodes in which they were originally experienced. Secondly, Prioritized Experience Replay is applied within the ExIt framework, to prioritise sampling experience from which we expect to obtain valuable training signals. Thirdly, a trained exploratory policy is used to diversify the trajectories experienced in self-play. This paper summarises the effects of these manipulations on training performance evaluated in fourteen different board games. We find major improvements in early training performance in some games, and minor improvements averaged over fourteen games.

READ FULL TEXT
research
04/07/2019

Policy Gradient Search: Online Planning and Expert Iteration without Search Trees

Monte Carlo Tree Search (MCTS) algorithms perform simulation-based searc...
research
05/14/2019

Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates

In recent years, state-of-the-art game-playing agents often involve poli...
research
02/23/2023

Targeted Search Control in AlphaZero for Effective Policy Improvement

AlphaZero is a self-play reinforcement learning algorithm that achieves ...
research
02/24/2021

Combining Off and On-Policy Training in Model-Based Reinforcement Learning

The combination of deep learning and Monte Carlo Tree Search (MCTS) has ...
research
03/22/2023

CH-Go: Online Go System Based on Chunk Data Storage

The training and running of an online Go system require the support of e...
research
11/28/2022

Learning to design without prior data: Discovering generalizable design strategies using deep learning and tree search

Building an AI agent that can design on its own has been a goal since th...
research
12/22/2020

Learning to Play Imperfect-Information Games by Imitating an Oracle Planner

We consider learning to play multiplayer imperfect-information games wit...

Please sign up or login with your details

Forgot password? Click here to reset