Compressing gradients by exploiting temporal correlation in momentum-SGD

by   Tharindu B. Adikari, et al.

An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information exchanged is quickly growing. While compression techniques have been introduced to cope with the latter, none has considered leveraging the temporal correlations that exist in consecutive vector updates. An important example is distributed momentum-SGD where temporal correlation is enhanced by the low-pass-filtering effect of applying momentum. In this paper we design and analyze compression methods that exploit temporal correlation in systems both with and without error-feedback. Experiments with the ImageNet dataset demonstrate that our proposed methods offer significant reduction in the rate of communication at only a negligible increase in computation complexity. We further analyze the convergence of SGD when compression is applied with error-feedback. In the literature, convergence guarantees are developed only for compressors that provide error-bounds point-wise, i.e., for each input to the compressor. In contrast, many important codes (e.g. rate-distortion codes) provide error-bounds only in expectation and thus provide a more general guarantee. In this paper we prove the convergence of SGD under an expected error assumption by establishing a bound for the minimum gradient norm.


Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

Communication overhead is a major bottleneck hampering the scalability o...

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

Adam is the important optimization algorithm to guarantee efficiency and...

Global Momentum Compression for Sparse Communication in Distributed SGD

With the rapid growth of data, distributed stochastic gradient descent (...

Linearly Converging Error Compensated SGD

In this paper, we propose a unified analysis of variants of distributed ...

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Scalable training of large models (like BERT and GPT-3) requires careful...

Error Feedback Fixes SignSGD and other Gradient Compression Schemes

Sign-based algorithms (e.g. signSGD) have been proposed as a biased grad...

A Simple Asymmetric Momentum Make SGD Greatest Again

We propose the simplest SGD enhanced method ever, Loss-Controlled Asymme...

Please sign up or login with your details

Forgot password? Click here to reset