Domain-specific Communication Optimization for Distributed DNN Training

by   Hao Wang, et al.

Communication overhead poses an important obstacle to distributed DNN training and draws increasing attention in recent years. Despite continuous efforts, prior solutions such as gradient compression/reduction, compute/communication overlapping and layer-wise flow scheduling, etc., are still coarse-grained and insufficient for an efficient distributed training especially when the network is under pressure. We present DLCP, a novel solution exploiting the domain-specific properties of deep learning to optimize communication overhead of DNN training in a fine-grained manner. At its heart, DLCP comprises of several key innovations beyond prior work: e.g., it exploits bounded loss tolerance of SGD-based training to improve tail communication latency which cannot be avoided purely through gradient compression. It then performs fine-grained packet-level prioritization and dropping, as opposed to flow-level scheduling, based on layers and magnitudes of gradients to further speedup model convergence without affecting accuracy. In addition, it leverages inter-packet order-independency to perform per-packet load balancing without causing classical re-ordering issues. DLCP works with both Parameter Server and collective communication routines. We have implemented DLCP with commodity switches, integrated it with various training frameworks including TensorFlow, MXNet and PyTorch, and deployed it in our small-scale testbed with 10 Nvidia V100 GPUs. Our testbed experiments and large-scale simulations show that DLCP delivers up to 84.3% additional training acceleration over the best existing solutions.


page 4

page 6

page 10


dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training

Distributed training using multiple devices (e.g., GPUs) has been widely...

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

It is important to scale out deep neural network (DNN) training for redu...

Homomorphic Parameter Compression for Distributed Deep Learning Training

Distributed training of deep neural networks has received significant re...

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

To reduce the long training time of large deep neural network (DNN) mode...

How to Attain Communication-Efficient DNN Training? Convert, Compress, Correct

In this paper, we introduce 𝖢𝖮_3, an algorithm for communication-efficie...

PRONTO: Preamble Overhead Reduction with Neural Networks for Coarse Synchronization

In IEEE 802.11 WiFi-based waveforms, the receiver performs coarse time a...

Flow-Level Packet Loss Detection via Sketch Decomposition and Matrix Optimization

For cloud service providers, fine-grained packet loss detection across d...

Please sign up or login with your details

Forgot password? Click here to reset