Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

by   Jiawei Huang, et al.

We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies π^O and π^E: π^O ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while π^E ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., π^E=π^O) for the risk-averse users. We individually consider the gap-independent vs. gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce π^E, we can achieve a constant regret for risk-averse users independent of the number of episodes K, which is in sharp contrast to the Ω(log K) regret for any online RL algorithms in the same setting, while the regret of π^O (almost) maintains its online regret optimality and does not need to compromise for the success of π^E.


page 1

page 2

page 3

page 4


Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR

In this paper, we study risk-sensitive Reinforcement Learning (RL), focu...

No-Regret Reinforcement Learning with Value Function Approximation: a Kernel Embedding Approach

We consider the regret minimisation problem in reinforcement learning (R...

Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret

We study risk-sensitive reinforcement learning in episodic Markov decisi...

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment

Multi-armed bandit (MAB) is a classic model for understanding the explor...

Incentivizing Exploration in Linear Bandits under Information Gap

We study the problem of incentivizing exploration for myopic users in li...

Cascaded Gaps: Towards Gap-Dependent Regret for Risk-Sensitive Reinforcement Learning

In this paper, we study gap-dependent regret guarantees for risk-sensiti...

The Externalities of Exploration and How Data Diversity Helps Exploitation

Online learning algorithms, widely used to power search and content opti...

Please sign up or login with your details

Forgot password? Click here to reset