Double Thompson Sampling in Finite stochastic Games

02/21/2022
βˆ™
by   Shuqing Shi, et al.
βˆ™
0
βˆ™

We consider the trade-off problem between exploration and exploitation under finite discounted Markov Decision Process, where the state transition matrix of the underlying environment stays unknown. We propose a double Thompson sampling reinforcement learning algorithm(DTS) to solve this kind of problem. This algorithm achieves a total regret bound of π’ͺΜƒ(D√(SAT))in time horizon T with S states, A actions and diameter D. DTS consists of two parts, the first part is the traditional part where we apply the posterior sampling method on transition matrix based on prior distribution. In the second part, we employ a count-based posterior update method to balance between the local optimal action and the long-term optimal action in order to find the global optimal game value. We established a regret bound of π’ͺΜƒ(√(T)/S^2). Which is by far the best regret bound for finite discounted Markov Decision Process to our knowledge. Numerical results proves the efficiency and superiority of our approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

βˆ™ 12/13/2019

Provably Efficient Reinforcement Learning with Aggregated States

We establish that an optimistic variant of Q-learning applied to a finit...
βˆ™ 07/01/2016

Why is Posterior Sampling Better than Optimism for Reinforcement Learning?

Computational results demonstrate that posterior sampling for reinforcem...
βˆ™ 07/22/2019

Convergence Rates of Posterior Distributions in Markov Decision Process

In this paper, we show the convergence rates of posterior distributions ...
βˆ™ 09/28/2022

Optimistic Posterior Sampling for Reinforcement Learning with Few Samples and Tight Guarantees

We consider reinforcement learning in an environment modeled by an episo...
βˆ™ 06/20/2019

Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

We tackle the problem of acting in an unknown finite and discrete Markov...
βˆ™ 03/01/2021

UCB Momentum Q-learning: Correcting the bias without forgetting

We propose UCBMQ, Upper Confidence Bound Momentum Q-learning, a new algo...
βˆ™ 02/22/2022

Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning

In today's economy, it becomes important for Internet platforms to consi...

Please sign up or login with your details

Forgot password? Click here to reset