DeepAI AI Chat
Log In Sign Up

Communication trade-offs for synchronized distributed SGD with large step size

by   Kumar Kshitij Patel, et al.

Synchronous mini-batch SGD is state-of-the-art for large-scale distributed machine learning. However, in practice, its convergence is bottlenecked by slow communication rounds between worker nodes. A natural solution to reduce communication is to use the `local-SGD' model in which the workers train their model independently and synchronize every once in a while. This algorithm improves the computation-communication trade-off but its convergence is not understood very well. We propose a non-asymptotic error analysis, which enables comparison to one-shot averaging i.e., a single communication round among independent workers, and mini-batch averaging i.e., communicating at every step. We also provide adaptive lower bounds on the communication frequency for large step-sizes ( t^-α , α∈ (1/2 , 1 ) ) and show that Local-SGD reduces communication by a factor of O(√(T)/P^3/2), with T the total number of gradients and P machines.


Local SGD Converges Fast and Communicates Little

Mini-batch stochastic gradient descent (SGD) is the state of the art in ...

Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization

Communication overhead is one of the key challenges that hinders the sca...

Trade-offs of Local SGD at Scale: An Empirical Study

As datasets and models become increasingly large, distributed training h...

Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

State-of-the-art distributed machine learning suffers from significant d...

Don't Use Large Mini-Batches, Use Local SGD

Mini-batch stochastic gradient methods are the current state of the art ...

Communication-efficient SGD: From Local SGD to One-Shot Averaging

We consider speeding up stochastic gradient descent (SGD) by parallelizi...

Unbiased Compression Saves Communication in Distributed Optimization: When and How Much?

Communication compression is a common technique in distributed optimizat...