Expert Q-learning: Deep Q-learning With State Values From Expert Examples

by   Li Meng, et al.

We propose a novel algorithm named Expert Q-learning. Expert Q-learning was inspired by Dueling Q-learning and aimed at incorporating the ideas from semi-supervised learning into reinforcement learning through splitting Q-values into state values and action advantages. Different from Generative Adversarial Imitation Learning and Deep Q-Learning from Demonstrations, the offline expert we have used only predicts the value of a state from -1, 0, 1, indicating whether this is a bad, neutral or good state. An expert network was designed in addition to the Q-network, which updates each time following the regular offline minibatch update whenever the expert example buffer is not empty. The Q-network plays the role of the advantage function only during the update. Our algorithm also keeps asynchronous copies of the Q-network and expert network, predicting the target values using the same manner as of Double Q-learning. We compared on the game of Othello our algorithm with the state-of-the-art Q-learning algorithm, which was a combination of Double Q-learning and Dueling Q-learning. The results showed that Expert Q-learning was indeed useful and more resistant to the overestimation bias of Q-learning. The baseline Q-learning algorithm exhibited unstable and suboptimal behavior, especially when playing against a stochastic player, whereas Expert Q-learning demonstrated more robust performance with higher scores. Expert Q-learning without using examples has also gained better results than the baseline algorithm when trained and tested against a fixed player. On the other hand, Expert Q-learning without examples cannot win against the baseline Q-learning algorithm in direct game competitions despite the fact that it has also shown the strength of reducing the overestimation bias.


page 1

page 2

page 3

page 4


Bayesian Q-learning With Imperfect Expert Demonstrations

Guided exploration with expert demonstrations improves data efficiency f...

TRAIL: Near-Optimal Imitation Learning with Suboptimal Data

The aim in imitation learning is to learn effective policies by utilizin...

Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations

We study the problem of offline Imitation Learning (IL) where an agent a...

Discriminator-Guided Model-Based Offline Imitation Learning

Offline imitation learning (IL) is a powerful method to solve decision-m...

Chain of Thought Imitation with Procedure Cloning

Imitation learning aims to extract high-performance policies from logged...

Case-Based Inverse Reinforcement Learning Using Temporal Coherence

Providing expert trajectories in the context of Imitation Learning is of...

Elastic Step DQN: A novel multi-step algorithm to alleviate overestimation in Deep QNetworks

Deep Q-Networks algorithm (DQN) was the first reinforcement learning alg...

Please sign up or login with your details

Forgot password? Click here to reset