On Efficient Constructions of Checkpoints

by   Yu Chen, et al.

Efficient construction of checkpoints/snapshots is a critical tool for training and diagnosing deep learning models. In this paper, we propose a lossy compression scheme for checkpoint constructions (called LC-Checkpoint). LC-Checkpoint simultaneously maximizes the compression rate and optimizes the recovery speed, under the assumption that SGD is used to train the model. LC-Checkpointuses quantization and priority promotion to store the most crucial information for SGD to recover, and then uses a Huffman coding to leverage the non-uniform distribution of the gradient scales. Our extensive experiments show that LC-Checkpoint achieves a compression rate up to 28× and recovery speedup up to 5.77× over a state-of-the-art algorithm (SCAR).


page 4

page 6

page 7


Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees

Communication compression is a crucial technique for modern distributed ...

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

To accelerate distributed training, many gradient compression methods ha...

Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations

Communication bottleneck has been identified as a significant issue in d...

FastSGD: A Fast Compressed SGD Framework for Distributed Machine Learning

With the rapid increase of big data, distributed Machine Learning (ML) h...

Decentralized Deep Learning with Arbitrary Communication Compression

Decentralized training of deep learning models is a key element for enab...

DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization

With the increase in the scale of Deep Learning (DL) training workloads ...

On the Utility of Gradient Compression in Distributed Training Systems

Rapid growth in data sets and the scale of neural network architectures ...

Please sign up or login with your details

Forgot password? Click here to reset