Expected Sarsa(λ) with Control Variate for Variance Reduction
Off-policy learning is powerful for reinforcement learning. However, the high variance of off-policy evaluation is a critical challenge, which causes off-policy learning with function approximation falls into an uncontrolled instability. In this paper, for reducing the variance, we introduce control variate technique to Expected Sarsa(λ) and propose a tabular ES(λ)-CV algorithm. We prove that if a proper estimator of value function reaches, the proposed ES(λ)-CV enjoys a lower variance than Expected Sarsa(λ). Furthermore, to extend ES(λ)-CV to be a convergent algorithm with linear function approximation, we propose the GES(λ) algorithm under the convex-concave saddle-point formulation. We prove that the convergence rate of GES(λ) achieves O(1/T), which matches or outperforms several state-of-art gradient-based algorithms, but we use a more relaxed step-size. Numerical experiments show that the proposed algorithm is stable and converges faster with lower variance than several state-of-art gradient-based TD learning algorithms: GQ(λ), GTB(λ) and ABQ(ζ).
READ FULL TEXT