Don't Use Large Mini-Batches, Use Local SGD

08/22/2018
by   Tao Lin, et al.
0

Mini-batch stochastic gradient methods are the current state of the art for large-scale distributed training of neural networks and other machine learning models. However, they fail to adapt to a changing communication vs computation trade-off in a system, such as when scaling to a large number of workers or devices. More so, the fixed requirement of communication bandwidth for gradient exchange severely limits the scalability to multi-node training e.g. in datacenters, and even more so for training on decentralized networks such as mobile devices. We argue that variants of local SGD, which perform several update steps on a local model before communicating to other nodes, offer significantly improved overall performance and communication efficiency, as well as adaptivity to the underlying system resources. Furthermore, we present a new hierarchical extension of local SGD, and demonstrate that it can efficiently adapt to several levels of computation costs in a heterogeneous distributed system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2018

Local SGD Converges Fast and Communicates Little

Mini-batch stochastic gradient descent (SGD) is the state of the art in ...
research
04/25/2019

Communication trade-offs for synchronized distributed SGD with large step size

Synchronous mini-batch SGD is state-of-the-art for large-scale distribut...
research
12/30/2020

Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability

Distributed deep learning is an effective way to reduce the training tim...
research
03/12/2019

Communication-efficient distributed SGD with Sketching

Large-scale distributed training of neural networks is often limited by ...
research
10/15/2021

Trade-offs of Local SGD at Scale: An Empirical Study

As datasets and models become increasingly large, distributed training h...
research
10/11/2018

signSGD with Majority Vote is Communication Efficient And Byzantine Fault Tolerant

Training neural networks on large datasets can be accelerated by distrib...
research
10/02/2020

Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing

Big data, including applications with high security requirements, are of...

Please sign up or login with your details

Forgot password? Click here to reset