Communication trade-offs for synchronized distributed SGD with large step size

04/25/2019
by   Kumar Kshitij Patel, et al.
0

Synchronous mini-batch SGD is state-of-the-art for large-scale distributed machine learning. However, in practice, its convergence is bottlenecked by slow communication rounds between worker nodes. A natural solution to reduce communication is to use the `local-SGD' model in which the workers train their model independently and synchronize every once in a while. This algorithm improves the computation-communication trade-off but its convergence is not understood very well. We propose a non-asymptotic error analysis, which enables comparison to one-shot averaging i.e., a single communication round among independent workers, and mini-batch averaging i.e., communicating at every step. We also provide adaptive lower bounds on the communication frequency for large step-sizes ( t^-α , α∈ (1/2 , 1 ) ) and show that Local-SGD reduces communication by a factor of O(√(T)/P^3/2), with T the total number of gradients and P machines.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset