L-GreCo: An Efficient and General Framework for Layerwise-Adaptive Gradient Compression

Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks due to gradient transmission. To address this issue, entire families of lossy gradient compression mechanisms have been developed, including quantization, sparsification, and low-rank approximation, some of which are seeing significant practical adoption. Despite this progress, almost all known compression schemes apply compression uniformly across DNN layers, although layers are heterogeneous in terms of parameter count and their impact on model accuracy. In this work, we provide a general framework for adapting the degree of compression across the model's layers dynamically during training, significantly improving the overall compression without sacrificing accuracy. Our framework, called L-GreCo, is based on an efficient adaptive algorithm, which automatically picks the optimal compression parameters for model layers guaranteeing the best compression ratio while respecting a theoretically-justified error constraint. Our extensive experimental study over image classification and language modeling tasks shows that L-GreCo is effective across all three compression families, and achieves up to 2.5× training speedup and up to 5× compression improvement over efficient implementations of standard approaches while recovering full accuracy. Moreover, we show that L-GreCo is complementary to existing adaptive algorithms improving their compression ratio by 50


page 8

page 9


Accordion: Adaptive Gradient Communication via Critical Learning Regime Identification

Distributed model training suffers from communication bottlenecks due to...

GraVAC: Adaptive Compression for Communication-Efficient Distributed DL Training

Distributed data-parallel (DDP) training improves overall application th...

AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training

Highly distributed training of Deep Neural Networks (DNNs) on future com...

RedSync : Reducing Synchronization Traffic for Distributed Deep Learning

Data parallelism has already become a dominant method to scale Deep Neur...

DNN gradient lossless compression: Can GenNorm be the answer?

In this paper, the problem of optimal gradient lossless compression in D...

Model compression as constrained optimization, with application to neural nets. Part I: general framework

Compressing neural nets is an active research problem, given the large s...

Toward Compact Parameter Representations for Architecture-Agnostic Neural Network Compression

This paper investigates deep neural network (DNN) compression from the p...

Please sign up or login with your details

Forgot password? Click here to reset