Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with On-Policy Experience

09/24/2021
by   Chayan Banerjee, et al.
0

Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning algorithm, essentially based on entropy regularization. SAC trains a policy by maximizing the trade-off between expected return and entropy (randomness in the policy). It has achieved state-of-the-art performance on a range of continuous-control benchmark tasks, outperforming prior on-policy and off-policy methods. SAC works in an off-policy fashion where data are sampled uniformly from past experiences (stored in a buffer) using which parameters of the policy and value function networks are updated. We propose certain crucial modifications for boosting the performance of SAC and make it more sample efficient. In our proposed improved SAC, we firstly introduce a new prioritization scheme for selecting better samples from the experience replay buffer. Secondly we use a mixture of the prioritized off-policy data with the latest on-policy data for training the policy and the value function networks. We compare our approach with the vanilla SAC and some recent variants of SAC and show that our approach outperforms the said algorithmic benchmarks. It is comparatively more stable and sample efficient when tested on a number of continuous control tasks in MuJoCo environments.

READ FULL TEXT

page 1

page 8

research
06/10/2019

Boosting Soft Actor-Critic: Emphasizing Recent Experience without Forgetting the Past

Soft Actor-Critic (SAC) is an off-policy actor-critic deep reinforcement...
research
06/02/2020

Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

Policy entropy regularization is commonly used for better exploration in...
research
12/06/2022

Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots

Many real-world continuous control problems are in the dilemma of weighi...
research
06/23/2020

Experience Replay with Likelihood-free Importance Weights

The use of past experiences to accelerate temporal difference (TD) learn...
research
10/05/2019

Towards Simplicity in Deep Reinforcement Learning: Streamlined Off-Policy Learning

The field of Deep Reinforcement Learning (DRL) has recently seen a surge...
research
02/07/2020

Off-policy Maximum Entropy Reinforcement Learning : Soft Actor-Critic with Advantage Weighted Mixture Policy(SAC-AWMP)

The optimal policy of a reinforcement learning problem is often disconti...
research
07/29/2020

Learning Object-conditioned Exploration using Distributed Soft Actor Critic

Object navigation is defined as navigating to an object of a given label...

Please sign up or login with your details

Forgot password? Click here to reset