Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

06/20/2019
by   Aristide Tossou, et al.
0

We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number D. An MDP consists of S states and A possible actions per state. Upon choosing an action a_t at state s_t, one receives a real value reward r_t, then one transits to a next state s_t+1. The reward r_t is generated from a fixed reward distribution depending only on (s_t, a_t) and similarly, the next state s_t+1 is generated from a fixed transition distribution depending only on (s_t, a_t). The objective is to maximize the accumulated rewards after T interactions. In this paper, we consider the case where the reward distributions, the transitions, T and D are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order Õ(√(DSAT)). Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.

READ FULL TEXT
research
03/31/2023

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

We consider online reinforcement learning in episodic Markov decision pr...
research
07/03/2019

Maximum Expected Hitting Cost of a Markov Decision Process and Informativeness of Rewards

We propose a new complexity measure for Markov decision processes (MDP),...
research
05/04/2023

Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

We investigate an infinite-horizon average reward Markov Decision Proces...
research
06/20/2019

Near-optimal Reinforcement Learning using Bayesian Quantiles

We study model-based reinforcement learning in finite communicating Mark...
research
01/21/2022

Under-Approximating Expected Total Rewards in POMDPs

We consider the problem: is the optimal expected total reward to reach a...
research
02/21/2022

Double Thompson Sampling in Finite stochastic Games

We consider the trade-off problem between exploration and exploitation u...
research
05/10/2017

Solving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective

In most common settings of Markov Decision Process (MDP), an agent evalu...

Please sign up or login with your details

Forgot password? Click here to reset