D^2: Decentralized Training over Decentralized Data

by   Hanlin Tang, et al.
ETH Zurich
University of Rochester
Michigan State University

While training a machine learning model using multiple workers, each of which collects data from their own data sources, it would be most useful when the data collected from different workers can be unique and different. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are not too different. In this paper, we ask the question: Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers? In this paper, we present D^2, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance among workers (imprecisely, "decentralized" data). The core of D^2 is a variance blackuction extension of the standard D-PSGD algorithm, which improves the convergence rate from O(σ√(nT) + (nζ^2)^1/3 T^2/3) to O(σ√(nT)) where ζ^2 denotes the variance among data on different workers. As a result, D^2 is robust to data variance among workers. We empirically evaluated D^2 on image classification tasks where each worker has access to only the data of a limited set of labels, and find that D^2 significantly outperforms D-PSGD.


page 1

page 2

page 3

page 4


A Sharp Convergence Rate for the Asynchronous Stochastic Gradient Descent

We give a sharp convergence rate for the asynchronous stochastic gradien...

'I am both here and there' Parallel Control of Multiple Robotic Avatars by Disabled Workers in a Café

Robotic avatars can help disabled people extend their reach in interacti...

Stochastic Gradient Coding for Straggler Mitigation in Distributed Learning

We consider distributed gradient descent in the presence of stragglers. ...

Weighted Aggregating Stochastic Gradient Descent for Parallel Deep Learning

This paper investigates the stochastic optimization problem with a focus...

Asynchronous Decentralized Learning over Unreliable Wireless Networks

Decentralized learning enables edge users to collaboratively train model...

Efficient Distributed Machine Learning via Combinatorial Multi-Armed Bandits

We consider the distributed stochastic gradient descent problem, where a...

Proof-of-Learning: Definitions and Practice

Training machine learning (ML) models typically involves expensive itera...

Please sign up or login with your details

Forgot password? Click here to reset