Quantized Distributed Training of Large Models with Convergence Guarantees

by   Ilia Markov, et al.
Max Planck Society
Institute of Science and Technology Austria

Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs). The recent emergence of large language models such as GPT has created the need for new approaches to exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability bottlenecks. One reason is that applying compression techniques to FSDP is challenging: as the vast majority of the communication involves the model's weights, direct compression alters convergence and leads to accuracy loss. We present QSDP, a variant of FSDP which supports both gradient and weight quantization with theoretical guarantees, is simple to implement and has essentially no overheads. To derive QSDP we prove that a natural modification of SGD achieves convergence even when we only maintain quantized weights, and thus the domain over which we train consists of quantized points and is, therefore, highly non-convex. We validate this approach by training GPT-family models with up to 1.3 billion parameters on a multi-node cluster. Experiments show that QSDP preserves model accuracy, while completely removing the communication bottlenecks of FSDP, providing end-to-end speedups of up to 2.2x.


page 1

page 2

page 3

page 4


NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

As the size and complexity of models and datasets grow, so does the need...

Recurrence of Optimum for Training Weight and Activation Quantized Networks

Deep neural networks (DNNs) are quantized for efficient inference on res...

Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

Communication compression is a crucial technique for modern distributed ...

ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

Large-scale distributed training of Deep Neural Networks (DNNs) on state...

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

In data-parallel synchronous training of deep neural networks, different...

Global-QSGD: Practical Floatless Quantization for Distributed Learning with Theoretical Guarantees

Efficient distributed training is a principal driver of recent advances ...

AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training

Highly distributed training of Deep Neural Networks (DNNs) on future com...

Please sign up or login with your details

Forgot password? Click here to reset