Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce

by   Zehua Cheng, et al.

It is inevitable to train large deep learning models on a large-scale cluster equipped with accelerators system. Deep gradient compression would highly increase the bandwidth utilization and speed up the training process but hard to implement on ring structure. In this paper, we find that redundant gradient and gradient staleness has negative effect on training. We have observed that in different epoch and different steps, the neural networks focus on updating different layers and different parameters. In order to save more communication bandwidth and preserve the accuracy on ring structure, which break the restrict as the node increase, we propose a new algorithm to measure the importance of gradients on large-scale cluster implementing ring all-reduce based on the size of the ratio of parameter calculation gradient to parameter value. Our importance weighted pruning approach achieved 64X and 58.8X of gradient compression ratio on AlexNet and ResNet50 on ImageNet. Meanwhile, in order to maintain the sparseness of the gradient propagation, we randomly broadcast the index of important gradients on each node. While the remaining nodes are ready for the index gradient and perform all-reduce update. This would speed up the convergence of the model and preserve the training accuracy.


Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Large-scale distributed training requires significant communication band...

Learned Gradient Compression for Distributed Deep Learning

Training deep neural networks on large datasets containing high-dimensio...

SuperNeurons: FFT-based Gradient Sparsification in the Distributed Training of Deep Neural Networks

The performance and efficiency of distributed training of Deep Neural Ne...

Efficient Distributed Auto-Differentiation

Although distributed machine learning has opened up numerous frontiers o...

GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training

Data parallelism can boost the training speed of convolutional neural ne...

CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning

We focus on the commonly used synchronous Gradient Descent paradigm for ...

A unification of least-squares and Green-Gauss gradients under a common projection-based gradient reconstruction framework

We propose a family of gradient reconstruction schemes based on the solu...

Please sign up or login with your details

Forgot password? Click here to reset