Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

by   Lin Zhang, et al.
Harbin Institute of Technology
Hong Kong Baptist University
The Hong Kong University of Science and Technology

To accelerate distributed training, many gradient compression methods have been proposed to alleviate the communication bottleneck in synchronous stochastic gradient descent (S-SGD), but their efficacy in real-world applications still remains unclear. In this work, we first evaluate the efficiency of three representative compression methods (quantization with Sign-SGD, sparsification with Top-k SGD, and low-rank with Power-SGD) on a 32-GPU cluster. The results show that they cannot always outperform well-optimized S-SGD or even worse due to their incompatibility with three key system optimization techniques (all-reduce, pipelining, and tensor fusion) in S-SGD. To this end, we propose a novel gradient compression method, called alternate compressed Power-SGD (ACP-SGD), which alternately compresses and communicates low-rank matrices. ACP-SGD not only significantly reduces the communication volume, but also enjoys the three system optimizations like S-SGD. Compared with Power-SGD, the optimized ACP-SGD can largely reduce the compression and communication overheads, while achieving similar model accuracy. In our experiments, ACP-SGD achieves an average of 4.06x and 1.43x speedups over S-SGD and Power-SGD, respectively, and it consistently outperforms other baselines across different setups (from 8 GPUs to 64 GPUs and from 1Gb/s Ethernet to 100Gb/s InfiniBand).


page 1

page 4

page 10


PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

We study gradient compression methods to alleviate the communication bot...

On the Utility of Gradient Compression in Distributed Training Systems

Rapid growth in data sets and the scale of neural network architectures ...

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

To train deep learning models faster, distributed training on multiple G...

On Efficient Constructions of Checkpoints

Efficient construction of checkpoints/snapshots is a critical tool for t...

Sparsified SGD with Memory

Huge scale machine learning problems are nowadays tackled by distributed...

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

To train large models (like BERT and GPT-3) with hundreds or even thousa...

Error Feedback Fixes SignSGD and other Gradient Compression Schemes

Sign-based algorithms (e.g. signSGD) have been proposed as a biased grad...

Please sign up or login with your details

Forgot password? Click here to reset