Fast multi-agent temporal-difference learning via homotopy stochastic primal-dual optimization

by   Dongsheng Ding, et al.

We consider a distributed multi-agent policy evaluation problem in reinforcement learning. In our setup, a group of agents with jointly observed states and private local actions and rewards collaborates to learn the value function of a given policy. When the dimension of state-action space is large, the temporal-difference learning with linear function approximation is widely used. Under the assumption that the samples are i.i.d., the best-known convergence rate for multi-agent temporal-difference learning is O(1/√(T)) minimizing the mean square projected Bellman error. In this paper, we formulate the temporal-difference learning as a distributed stochastic saddle point problem, and propose a new homotopy primal-dual algorithm by adaptively restarting the gradient update from the average of previous iterations. We prove that our algorithm enjoys an O(1/T) convergence rate up to logarithmic factors of T, thereby significantly improving the previously-known convergence results on multi-agent temporal-difference learning. Furthermore, since our result explicitly takes into account the Markovian nature of the sampling in policy evaluation, it addresses a broader class of problems than the commonly used i.i.d. sampling scenario. From a stochastic optimization perspective, to the best of our knowledge, the proposed homotopy primal-dual algorithm is the first to achieve O(1/T) convergence rate for distributed stochastic saddle point problem.


page 1

page 2

page 3

page 4


Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization

Despite the success of single-agent reinforcement learning, multi-agent ...

Voting-Based Multi-Agent Reinforcement Learning

The recent success of single-agent reinforcement learning (RL) encourage...

A Law of Iterated Logarithm for Multi-Agent Reinforcement Learning

In Multi-Agent Reinforcement Learning (MARL), multiple agents interact w...

A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound

Policy evaluation in reinforcement learning is often conducted using two...

Online Off-policy Prediction

This paper investigates the problem of online prediction learning, where...

Simple and optimal methods for stochastic variational inequalities, II: Markovian noise and policy evaluation in reinforcement learning

The focus of this paper is on stochastic variational inequalities (VI) u...

Distributed TD(0) with Almost No Communication

We provide a new non-asymptotic analysis of distributed TD(0) with linea...

Please sign up or login with your details

Forgot password? Click here to reset