Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

11/25/2020
by   Jingfeng Wu, et al.
3

In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a regret bound 𝒪(√(min{d,S}· H^3 SAK)), where d is the number of objectives, S is the number of states, A is the number of actions, H is the length of the horizon, and K is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vectors up to ϵ error. Our proposed algorithm is provably efficient with a nearly optimal sample complexity 𝒪(min{d,S}· H^4 SA/ϵ^2).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/19/2020

On Reward-Free Reinforcement Learning with Linear Function Approximation

Reward-free reinforcement learning (RL) is a framework which is suitable...
research
08/21/2019

A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation

We introduce a new algorithm for multi-objective reinforcement learning ...
research
05/23/2022

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

We study human-in-the-loop reinforcement learning (RL) with trajectory p...
research
06/08/2021

Exploration and preference satisfaction trade-off in reward-free learning

Biological agents have meaningful interactions with their environment de...
research
05/10/2017

Solving Multi-Objective MDP with Lexicographic Preference: An application to stochastic planning with multiple quantile objective

In most common settings of Markov Decision Process (MDP), an agent evalu...
research
01/18/2023

Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization

Multi-objective reinforcement learning (MORL) algorithms tackle sequenti...
research
05/15/2019

Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards

We consider an agent who is involved in a Markov decision process and re...

Please sign up or login with your details

Forgot password? Click here to reset