Performance of Bounded-Rational Agents With the Ability to Self-Modify

11/12/2020
by   Jakub Tětek, et al.
0

Self-modification of agents embedded in complex environments is hard to avoid, whether it happens via direct means (e.g. own code modification) or indirectly (e.g. influencing the operator, exploiting bugs or the environment). While it has been argued that intelligent agents have an incentive to avoid modifying their utility function so that their future instances will work towards the same goals, it is not clear whether this also applies in non-dualistic scenarios, where the agent is embedded in the environment. The problem of self-modification safety is raised by Bostrom in Superintelligence (2014) in the context of safe AGI deployment. In contrast to Everitt et al. (2016), who formally show that providing an option to self-modify is harmless for perfectly rational agents, we show that for agents with bounded rationality, self-modification may cause exponential deterioration in performance and gradual misalignment of a previously aligned agent. We investigate how the size of this effect depends on the type and magnitude of imperfections in the agent's rationality (1-4 below). We also discuss model assumptions and the wider problem and framing space. Specifically, we introduce several types of a bounded-rational agent, which either (1) doesn't always choose the optimal action, (2) is not perfectly aligned with human values, (3) has an innacurate model of the environment, or (4) uses the wrong temporal discounting factor. We show that while in the cases (2)-(4) the misalignment caused by the agent's imperfection does not worsen over time, with (1) the misalignment may grow exponentially.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/16/2011

Model-based Utility Functions

Orseau and Ring, as well as Dewey, have recently described problems, inc...
research
02/25/2019

Embedded Agency

Traditional models of rational action treat the agent as though it is cl...
research
06/21/2019

Categorizing Wireheading in Partially Embedded Agents

Embedded agents are not explicitly separated from their environment, lac...
research
11/24/2016

The Off-Switch Game

It is clear that one of the primary tools we can use to mitigate the pot...
research
02/27/2017

Don't Fear the Reaper: Refuting Bostrom's Superintelligence Argument

In recent years prominent intellectuals have raised ethical concerns abo...
research
08/05/2019

Corrigibility with Utility Preservation

Corrigibility is a safety property for artificially intelligent agents. ...
research
09/13/2016

Self-Sustaining Iterated Learning

An important result from psycholinguistics (Griffiths & Kalish, 2005) st...

Please sign up or login with your details

Forgot password? Click here to reset