Regret Minimization Experience Replay

05/15/2021
by   Zhenghai Xue, et al.
0

Experience replay is widely used in various deep off-policy reinforcement learning (RL) algorithms. It stores previously collected samples for further reuse. To better utilize these samples, prioritized sampling is a promising technique to improve the performance of RL agents. Previous prioritization methods based on temporal-difference (TD) error are highly heuristic and divergent from the objective of RL. In this work, we analyze the optimal prioritization strategy that can minimize the regret of RL policy theoretically. Our theory suggests that the data with higher TD error, better on-policiness and more corrective feedback should be assigned with higher weights during sampling. Based on this theory, we propose two practical algorithms, RM-DisCor and RM-TCE. RM-DisCor is a general algorithm and RM-TCE is a more efficient variant relying on the temporal ordering of states. Both algorithms improve the performance of off-policy RL algorithms in challenging RL benchmarks, including MuJoCo, Atari and Meta-World.

READ FULL TEXT
research
02/22/2021

Stratified Experience Replay: Correcting Multiplicity Bias in Off-Policy Reinforcement Learning

Deep Reinforcement Learning (RL) methods rely on experience replay to ap...
research
02/21/2023

MAC-PO: Multi-Agent Experience Replay via Collective Priority Optimization

Experience replay is crucial for off-policy reinforcement learning (RL) ...
research
07/13/2020

Revisiting Fundamentals of Experience Replay

Experience replay is central to off-policy algorithms in deep reinforcem...
research
07/14/2020

Learning to Sample with Local and Global Contexts in Experience Replay Buffer

Experience replay, which enables the agents to remember and reuse experi...
research
06/08/2023

Offline Prioritized Experience Replay

Offline reinforcement learning (RL) is challenged by the distributional ...
research
08/25/2022

Variance Reduction based Experience Replay for Policy Optimization

For reinforcement learning on complex stochastic systems where many fact...
research
12/08/2021

Convergence Results For Q-Learning With Experience Replay

A commonly used heuristic in RL is experience replay (e.g. <cit.>), in w...

Please sign up or login with your details

Forgot password? Click here to reset