Optimized Network Architectures for Large Language Model Training with Billions of Parameters

07/22/2023
by   Weiyang Wang, et al.
0

This paper challenges the well-established paradigm for building any-to-any networks for training Large Language Models (LLMs). We show that LLMs exhibit a unique communication pattern where only small groups of GPUs require high-bandwidth any-to-any communication within them, to achieve near-optimal training performance. Across these groups of GPUs, the communication is insignificant, sparse, and homogeneous. We propose a new network architecture that closely resembles the communication requirement of LLMs. Our architecture partitions the cluster into sets of GPUs interconnected with non-blocking any-to-any high-bandwidth interconnects that we call HB domains. Across the HB domains, the network only connects GPUs with communication demands. We call this network a "rail-only" connection, and show that our proposed architecture reduces the network cost by up to 75 any-to-any Clos networks without compromising the performance of LLM training.

READ FULL TEXT

page 2

page 3

page 4

research
04/30/2022

MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud

Existing general purpose frameworks for gigantic model training, i.e., m...
research
03/24/2023

Scaling Expert Language Models with Unsupervised Domain Discovery

Large language models are typically trained densely: all parameters are ...
research
09/03/2023

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

The rapid growth of memory and computation requirements of large languag...
research
06/08/2018

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

PipeDream is a Deep Neural Network(DNN) training system for GPUs that pa...
research
02/28/2020

Deterministic Intra-Vehicle Communications: Timing and Synchronization

As we power through to the future, in-vehicle communications reliance on...
research
01/04/2022

Architectural improvements and technological enhancements for the APEnet+ interconnect system

The APEnet+ board delivers a point-to-point, low-latency, 3D torus netwo...
research
03/11/2018

Scalable Breadth-First Search on a GPU Cluster

On a GPU cluster, the ratio of high computing power to communication ban...

Please sign up or login with your details

Forgot password? Click here to reset