A Bayesian Approach to Learning Bandit Structure in Markov Decision Processes

by   Kelly W. Zhang, et al.

In the reinforcement learning literature, there are many algorithms developed for either Contextual Bandit (CB) or Markov Decision Processes (MDP) environments. However, when deploying reinforcement learning algorithms in the real world, even with domain expertise, it is often difficult to know whether it is appropriate to treat a sequential decision making problem as a CB or an MDP. In other words, do actions affect future states, or only the immediate rewards? Making the wrong assumption regarding the nature of the environment can lead to inefficient learning, or even prevent the algorithm from ever learning an optimal policy, even with infinite data. In this work we develop an online algorithm that uses a Bayesian hypothesis testing approach to learn the nature of the environment. Our algorithm allows practitioners to incorporate prior knowledge about whether the environment is that of a CB or an MDP, and effectively interpolate between classical CB and MDP-based algorithms to mitigate against the effects of misspecifying the environment. We perform simulations and demonstrate that in CB settings our algorithm achieves lower regret than MDP-based algorithms, while in non-bandit MDP settings our algorithm is able to learn the optimal policy, often achieving comparable regret to MDP-based algorithms.


page 1

page 2

page 3

page 4


Refined Analysis of FPL for Adversarial Markov Decision Processes

We consider the adversarial Markov Decision Process (MDP) problem, where...

BATS: Best Action Trajectory Stitching

The problem of offline reinforcement learning focuses on learning a good...

Hindsight is Only 50/50: Unsuitability of MDP based Approximate POMDP Solvers for Multi-resolution Information Gathering

Partially Observable Markov Decision Processes (POMDPs) offer an elegant...

Optimal Path Planning of Autonomous Marine Vehicles in Stochastic Dynamic Ocean Flows using a GPU-Accelerated Algorithm

Autonomous marine vehicles play an essential role in many ocean science ...

Receding Horizon Curiosity

Sample-efficient exploration is crucial not only for discovering rewardi...

Markov Decision Process with an External Temporal Process

Most reinforcement learning algorithms treat the context under which the...

Reinforcement Learning with Feedback Graphs

We study episodic reinforcement learning in Markov decision processes wh...

Please sign up or login with your details

Forgot password? Click here to reset