Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning

by   Jakob N. Foerster, et al.

When observing the actions of others, humans carry out inferences about why the others acted as they did, and what this implies about their view of the world. Humans also use the fact that their actions will be interpreted in this manner when observed by others, allowing them to act informatively and thereby communicate efficiently with others. Although learning algorithms have recently achieved superhuman performance in a number of two-player, zero-sum games, scalable multi-agent reinforcement learning algorithms that can discover effective strategies and conventions in complex, partially observable settings have proven elusive. We present the Bayesian action decoder (BAD), a new multi-agent learning method that uses an approximate Bayesian update to obtain a public belief that conditions on the actions taken by all agents in the environment. Together with the public belief, this Bayesian update effectively defines a new Markov decision process, the public belief MDP, in which the action space consists of deterministic partial policies, parameterised by deep neural networks, that can be sampled for a given public state. It exploits the fact that an agent acting only on this public belief state can still learn to use its private information if the action space is augmented to be over partial policies mapping private information into environment actions. The Bayesian update is also closely related to the theory of mind reasoning that humans carry out when observing others' actions. We first validate BAD on a proof-of-principle two-step matrix game, where it outperforms traditional policy gradient methods. We then evaluate BAD on the challenging, cooperative partial-information card game Hanabi, where in the two-player setting the method surpasses all previously published learning and hand-coded approaches.


page 1

page 2

page 3

page 4


Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning

In recent years we have seen fast progress on a number of benchmark prob...

Learned Belief Search: Efficiently Improving Policies in Partially Observable Settings

Search is an important tool for computing effective policies in single- ...

Solving Common-Payoff Games with Approximate Policy Iteration

For artificially intelligent learning systems to have widespread applica...

Learning Multi-agent Implicit Communication Through Actions: A Case Study in Contract Bridge, a Collaborative Imperfect-Information Game

In situations where explicit communication is limited, a human collabora...

Improving Policies via Search in Cooperative Partially Observable Games

Recent superhuman results in games have largely been achieved in a varie...

Learning to Act and Observe in Partially Observable Domains

We consider a learning agent in a partially observable environment, with...

Learning from Humans as an I-POMDP

The interactive partially observable Markov decision process (I-POMDP) i...

Please sign up or login with your details

Forgot password? Click here to reset