How memory architecture affects performance and learning in simple POMDPs

06/16/2021
by   Mario Geiger, et al.
0

Reinforcement learning is made much more complex when the agent's observation is partial or noisy. This case corresponds to a partially observable Markov decision process (POMDP). One strategy to seek good performance in POMDPs is to endow the agent with a finite memory, whose update is governed by the policy. However, policy optimization is non-convex in that case and can lead to poor training performance for random initialization. The performance can be empirically improved by constraining the memory architecture, then sacrificing optimality to facilitate training. Here we study this trade-off in the two-arm bandit problem, and compare two extreme cases: (i) the random access memory where any transitions between M memory states are allowed and (ii) a fixed memory where the agent can access its last m actions and rewards. For (i), the probability q to play the worst arm is known to be exponentially small in M for the optimal policy. Our main result is to show that similar performance can be reached for (ii) as well, despite the simplicity of the memory architecture: using a conjecture on Gray-ordered binary necklaces, we find policies for which q is exponentially small in 2^m i.e. q∼α^2^m for some α < 1. Interestingly, we observe empirically that training from random initialization leads to very poor results for (i), and significantly better results for (ii).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/05/2020

The act of remembering: a study in partially observable reinforcement learning

Reinforcement Learning (RL) agents typically learn memoryless policies—p...
research
06/22/2020

Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

Partial observability is a common challenge in many reinforcement learni...
research
02/24/2021

Memory-based Deep Reinforcement Learning for POMDP

A promising characteristic of Deep Reinforcement Learning (DRL) is its c...
research
02/08/2022

Provable Reinforcement Learning with a Short-Term Memory

Real-world sequential decision making problems commonly involve partial ...
research
07/31/2019

Optimal Attacks on Reinforcement Learning Policies

Control policies, trained using the Deep Reinforcement Learning, have be...
research
11/02/2012

Learning classifier systems with memory condition to solve non-Markov problems

In the family of Learning Classifier Systems, the classifier system XCS ...
research
11/24/2020

Time Limits in Reinforcement Learning

In reinforcement learning, it is common to let an agent interact for a f...

Please sign up or login with your details

Forgot password? Click here to reset