Corrigibility with Utility Preservation

by   Koen Holtman, et al.

Corrigibility is a safety property for artificially intelligent agents. A corrigible agent will not resist attempts by authorized parties to alter the goals and constraints that were encoded in the agent when it was first started. This paper shows how to construct a safety layer that adds corrigibility to arbitrarily advanced utility maximizing agents, including possible future agents with Artificial General Intelligence (AGI). The layer counter-acts the emergent incentive of advanced agents to resist such alteration. A detailed model for agents which can reason about preserving their utility function is developed, and used to prove that the corrigibility layer works as intended in a large set of non-hostile universes. The corrigible agents have an emergent incentive to protect key elements of their corrigibility layer. However, hostile universes may contain forces strong enough to break safety features. Some open problems related to graceful degradation when an agent is successfully attacked are identified. The results in this paper were obtained by concurrently developing an AGI agent simulator, an agent model, and proofs. The simulator is available under an open source license. The paper contains simulation results which illustrate the safety related properties of corrigible AGI agents in detail.


page 1

page 2

page 3

page 4


AGI Agent Safety by Iteratively Improving the Utility Function

While it is still unclear if agents with Artificial General Intelligence...

Of Models and Tin Men – a behavioural economics study of principal-agent problems in AI alignment using large-language models

AI Alignment is often presented as an interaction between a single desig...

Counterfactual Planning in AGI Systems

We present counterfactual planning as a design approach for creating a r...

NASimEmu: Network Attack Simulator Emulator for Training Agents Generalizing to Novel Scenarios

Current frameworks for training offensive penetration testing agents wit...

Performance of Bounded-Rational Agents With the Ability to Self-Modify

Self-modification of agents embedded in complex environments is hard to ...

Ontological Crises in Artificial Agents' Value Systems

Decision-theoretic agents predict and evaluate the results of their acti...

Break and Make: Interactive Structural Understanding Using LEGO Bricks

Visual understanding of geometric structures with complex spatial relati...

Please sign up or login with your details

Forgot password? Click here to reset