SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

by   Max Ryabinin, et al.

Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.


page 4

page 6


Does compressing activations help model parallel training?

Large-scale Transformer models are known for their exceptional performan...

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Recent work in unsupervised language modeling demonstrates that training...

Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression

In training of modern large natural language processing (NLP) models, it...

Decentralized Training of Foundation Models in Heterogeneous Environments

Training foundation models, such as GPT-3 and PaLM, can be extremely exp...

Towards a Better Theoretical Understanding of Independent Subnetwork Training

Modern advancements in large-scale machine learning would be impossible ...

SCNN: Swarm Characteristic Neural Network

Deep learning is a powerful approach with good performance on many diffe...

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

The size of Transformer models is growing at an unprecedented pace. It h...

Please sign up or login with your details

Forgot password? Click here to reset