NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration

by   Shuo Liu, et al.

We present NetReduce, a novel RDMA-compatible in-network reduction architecture to accelerate distributed DNN training. Compared to existing designs, NetReduce maintains a reliable connection between end-hosts in the Ethernet and does not terminate the connection in the network. The advantage of doing so is that we can fully reuse the designs of congestion control and reliability in RoCE. In the meanwhile, we do not need to implement a high-cost network protocol processing stack in the switch, as IB does. The prototype implemented by using FPGA is an out-of-box solution without modifying commodity devices such as NICs or switches. For the coordination between the end-host and the switch, NetReduce customizes the transport protocol only on the first packet in a data message to comply with RoCE v2. The special status monitoring module is designed to reuse the reliability mechanism of RoCE v2 for dealing with packet loss. A message-level credit-based flow control algorithm is also proposed to fully utilize bandwidth and avoid buffer overflow. We study the effects of intra bandwidth on the training performance in multi-machines multi-GPUs scenario and give sufficient conditions for hierarchical NetReduce to outperform other algorithms. We also extend the design from rack-level aggregation to more general spine-leaf topology in the data center. NetReduce accelerates the training up to 1.7x and 1.5x for CNN-based CV and transformer-based NLP tasks, respectively. Simulations on large-scale systems indicate the superior scalability of NetReduce to the state-of-the-art ring all-reduce.


QUIC (Quick UDP Internet Connections) – A Quick Study

Main responsibility of a transport protocol is to support communication ...

Kollaps: Decentralized and Dynamic Topology Emulation

The performance and behavior of large-scale distributed applications is ...

Reliable and Distributed Network Monitoring via In-band Network Telemetry

Traditional network monitoring solutions usually lack of scalability due...

SwitchAgg:A Further Step Towards In-Network Computation

Many distributed applications adopt a partition/aggregation pattern to a...

Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks

The design of DNNs has increasingly focused on reducing the computationa...

Gleam: An RDMA-accelerated Multicast Protocol for Datacenter Networks

RDMA has been widely adopted for high-speed datacenter networks. However...

MANETs monitoring with a distributed hybrid architecture

Monitoring techniques have been deeply studied in wired networks using g...

Please sign up or login with your details

Forgot password? Click here to reset