Global-QSGD: Practical Floatless Quantization for Distributed Learning with Theoretical Guarantees

by   Jihao Xin, et al.

Efficient distributed training is a principal driver of recent advances in deep learning. However, communication often proves costly and becomes the primary bottleneck in these systems. As a result, there is a demand for the design of efficient communication mechanisms that can empirically boost throughput while providing theoretical guarantees. In this work, we introduce Global-QSGD, a novel family of quantization operators, engineered to accelerate distributed training based on global scaling. We demonstrate that Global-QSGD is the first theoretically rigorous Allreduce-compatible compression mechanism that achieves a provable speed-up by striking a balance between compression error and communication savings. Importantly, Global-QSGD does not rely on costly error feedback due to its inherent unbiasedness and offers up to O(√(n)) additional compression ratio compared to the popular QSGD quantization (n represents the number of workers). To obtain theoretical guarantees, we generalize the notion of standard unbiased compression operators to incorporate Global-QSGD. We show that this wider class permits standard analysis for unbiased compressors and thus ensures convergence for popular optimization algorithms (e.g., distributed SGD) under typical settings. For the empirical component of our work, we carry out a performance modeling analysis to determine if Global-QSGD can enhance training throughput under specific hardware configurations. We also conduct extensive empirical evaluations on various tasks, testing our theory on both NVLink and PCIe connections as well as a large-scale cloud system.


page 1

page 2

page 3

page 4


NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

As the size and complexity of models and datasets grow, so does the need...

Smoothness-Aware Quantization Techniques

Distributed machine learning has become an indispensable tool for traini...

NUQSGD: Improved Communication Efficiency for Data-parallel SGD via Nonuniform Quantization

As the size and complexity of models and datasets grow, so does the need...

Natural Compression for Distributed Deep Learning

Due to their hunger for big data, modern deep learning models are traine...

Quantized Distributed Training of Large Models with Convergence Guarantees

Communication-reduction techniques are a popular way to improve scalabil...

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Scalable training of large models (like BERT and GPT-3) requires careful...

A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning

Modern large-scale machine learning applications require stochastic opti...

Please sign up or login with your details

Forgot password? Click here to reset