Communication Optimization Strategies for Distributed Deep Learning: A Survey

03/06/2020
by   Shuo Ouyang, et al.
0

Recent trends in high-performance computing and deep learning lead to a proliferation of studies on large-scale deep neural network (DNN) training. However, the frequent communication requirements among computation nodes drastically slow down the overall training speed, which makes the bottleneck in distributed training, particularly in clusters with limited network bandwidth. To mitigate the drawbacks of distributed communication, researchers have proposed various optimization strategies. In this paper, we give a comprehensive survey of communication strategies from both algorithm and computer network perspectives. Algorithm optimizations focus on reducing the amount of communication in distributed training, while network optimizations focus on speeding up the communication between distributed devices. At the algorithm level, we describe how to reduce the number of communication rounds and transmitted bits per round, besides we shed light on how to overlap computation and communication. At the network level, we discuss the effect caused by network infrastructures, including communication schemes, network protocols, and topology. Finally, we extrapolate potential challenges and research directions for communication acceleration in distributed DNN training.

READ FULL TEXT

page 2

page 7

page 10

page 11

page 13

research
03/10/2020

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Distributed deep learning becomes very common to reduce the overall trai...
research
05/05/2022

dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training

Distributed training using multiple devices (e.g., GPUs) has been widely...
research
02/01/2022

TopoOpt: Optimizing the Network Topology for Distributed DNN Training

We explore a novel approach for building DNN training clusters using com...
research
10/02/2019

Accelerating Data Loading in Deep Neural Network Training

Data loading can dominate deep neural network training time on large-sca...
research
04/29/2020

Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling

The method of choice for parameter aggregation in Deep Neural Network (D...
research
09/26/2022

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

This paper proposes DisCo, an automatic deep learning compilation module...
research
03/06/2020

Trends and Advancements in Deep Neural Network Communication

Due to their great performance and scalability properties neural network...

Please sign up or login with your details

Forgot password? Click here to reset