OpTree: An Efficient Algorithm for All-gather Operation in Optical Interconnect Systems

11/28/2022
by   Fei Dai, et al.
0

All-gather collective communication is one of the most important communication primitives in parallel and distributed computation, which plays an essential role in many HPC applications such as distributed Deep Learning (DL) with model and hybrid parallelism. To solve the communication bottleneck of All-gather, optical interconnection network can provide unprecedented high bandwidth and reliability for data transfer among the distributed nodes. However, most traditional All-gather algorithms are designed for electrical interconnection, which cannot fit well for optical interconnect systems, resulting in poor performance. This paper proposes an efficient scheme, called OpTree, for All-gather operation on optical interconnect systems. OpTree derives an optimal m-ary tree corresponding to the optimal number of communication stages, achieving minimum communication time. We further analyze and compare the communication steps of OpTree with existing All-gather algorithms. Theoretical results exhibit that OpTree requires much less number of communication steps than existing All-gather algorithms on optical interconnect systems. Simulation results show that OpTree can reduce communication time by 72.21 three existing All-gather schemes, WRHT, Ring, and NE.

READ FULL TEXT
research
07/22/2022

WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System

Communication efficiency plays an important role in accelerating the dis...
research
11/28/2022

RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems

Distributed deep learning (DDL) systems strongly depend on network perfo...
research
03/12/2019

Distributed Dependency Discovery

We analyze the problem of discovering dependencies from distributed big ...
research
11/10/2020

Role of Digital Twin in Optical Communication: Fault Management, Hardware Configuration, and Transmission Simulation

Optical communication is developing rapidly in the directions of hardwar...
research
03/10/2020

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Distributed deep learning becomes very common to reduce the overall trai...
research
07/10/2017

Exploiting Parallelism in Optical Network Systems: A Case Study of Random Linear Network Coding (RLNC) in Ethernet-over-Optical Networks

As parallelism becomes critically important in the semiconductor technol...
research
02/13/2020

Hoplite: Efficient Collective Communication for Task-Based Distributed Systems

Collective communication systems such as MPI offer high performance grou...

Please sign up or login with your details

Forgot password? Click here to reset