Does compressing activations help model parallel training?

by   Song Bian, et al.
University of Wisconsin-Madison

Large-scale Transformer models are known for their exceptional performance in a range of tasks, but training them can be difficult due to the requirement for communication-intensive model parallelism. One way to improve training speed is to compress the message size in communication. Previous approaches have primarily focused on compressing gradients in a data parallelism setting, but compression in a model-parallel setting is an understudied area. We have discovered that model parallelism has fundamentally different characteristics than data parallelism. In this work, we present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms - pruning-based, learning-based, and quantization-based - using a popular Transformer training framework. We evaluate these methods across more than 160 settings and 8 popular datasets, taking into account different hyperparameters, hardware, and both fine-tuning and pre-training stages. We also provide analysis when the model is scaled up. Finally, we provide insights for future development of model parallelism compression algorithms.


SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Many deep learning applications benefit from using large models with bil...

Towards a Better Theoretical Understanding of Independent Subnetwork Training

Modern advancements in large-scale machine learning would be impossible ...

Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

Communication compression is a crucial technique for modern distributed ...

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

In training of modern large natural language processing (NLP) models, it...

Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training

Parallel training of neural networks at scale is challenging due to sign...

Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism

Data Parallelism (DP) and Model Parallelism (MP) are two common paradigm...

Exploring Low-Cost Transformer Model Compression for Large-Scale Commercial Reply Suggestions

Fine-tuning pre-trained language models improves the quality of commerci...

Please sign up or login with your details

Forgot password? Click here to reset