The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

01/10/2022
by   Alexander Pan, et al.
0

Reward hacking – where RL agents exploit gaps in misspecified reward functions – has been widely observed, but not yet systematically studied. To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. More capable agents often exploit reward misspecifications, achieving higher proxy reward and lower true reward than less capable agents. Moreover, we find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward. Such phase transitions pose challenges to monitoring the safety of ML systems. To address this, we propose an anomaly detection task for aberrant policies and offer several baseline detectors.

READ FULL TEXT
research
02/27/2023

Reward Design with Language Models

Reward design in reinforcement learning (RL) is challenging since specif...
research
06/11/2020

Exploration by Maximizing Rényi Entropy for Zero-Shot Meta RL

Exploring the transition dynamics is essential to the success of reinfor...
research
11/01/2019

Positive-Unlabeled Reward Learning

Learning reward functions from data is a promising path towards achievin...
research
04/02/2022

Safe Reinforcement Learning via Shielding for POMDPs

Reinforcement learning (RL) in safety-critical environments requires an ...
research
01/25/2022

Dynamics-Aware Comparison of Learned Reward Functions

The ability to learn reward functions plays an important role in enablin...
research
11/17/2020

Avoiding Tampering Incentives in Deep RL via Decoupled Approval

How can we design agents that pursue a given objective when all feedback...
research
09/27/2022

Defining and Characterizing Reward Hacking

We provide the first formal definition of reward hacking, a phenomenon w...

Please sign up or login with your details

Forgot password? Click here to reset