Reward is enough for convex MDPs

06/01/2021
by   Tom Zahavy, et al.
0

Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov Decision Process (MDP) based on the Reinforcement Learning (RL) problem formulation. However, not all goals can be captured in this manner. Specifically, it is easy to see that Convex MDPs in which goals are expressed as convex functions of stationary distributions cannot, in general, be formulated in this manner. In this paper, we reformulate the convex MDP problem as a min-max game between the policy and cost (negative reward) players using Fenchel duality and propose a meta-algorithm for solving it. We show that the average of the policies produced by an RL agent that maximizes the non-stationary reward produced by the cost player converges to an optimal solution to the convex MDP. Finally, we show that the meta-algorithm unifies several disparate branches of reinforcement learning algorithms in the literature, such as apprenticeship learning, variational intrinsic control, constrained MDPs, and pure exploration into a single framework.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2019

Reinforcement Learning with Non-Markovian Rewards

The standard RL world model is that of a Markov Decision Process (MDP). ...
research
08/03/2023

Aligning Agent Policy with Externalities: Reward Design via Bilevel RL

In reinforcement learning (RL), a reward function is often assumed at th...
research
10/06/2022

Learning Algorithms for Intelligent Agents and Mechanisms

In this thesis, we research learning algorithms for optimal decision mak...
research
10/29/2019

Constrained Reinforcement Learning Has Zero Duality Gap

Autonomous agents must often deal with conflicting requirements, such as...
research
05/17/2020

Optimizing for the Future in Non-Stationary MDPs

Most reinforcement learning methods are based upon the key assumption th...
research
10/21/2017

Insulin Regimen ML-based control for T2DM patients

We model individual T2DM patient blood glucose level (BGL) by stochasti...
research
12/24/2021

Multi-Provider NFV Network Service Delegation via Average Reward Reinforcement Learning

In multi-provider 5G/6G networks, service delegation enables administrat...

Please sign up or login with your details

Forgot password? Click here to reset