Computation vs. Communication Scaling for Future Transformers on Future Hardware

02/06/2023
by   Suchita Pati, et al.
0

Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is important to understand how will compute and communication scale relative to one another as models scale and hardware evolves? A careful study which answers this question can better guide the design of future systems which can efficiently train future large models. Accordingly, this work provides a comprehensive multi-axial (algorithmic, empirical, hardware evolution) analysis of compute vs. communication (Comp-vs.-Comm) scaling for future Transformer models on future hardware. First, our algorithmic analysis shows that compute generally enjoys an edge over communication as models scale. However, since memory capacity scales slower than compute, these trends are being stressed. Next, we quantify this edge by empirically studying how Comp-vs.-Comm scales for future models on future hardware. To avoid profiling numerous Transformer models across many setups, we extract execution regions and project costs using operator models. This allows a spectrum (hundreds) of future model/hardware scenarios to be accurately studied (<15 Our experiments show that communication will be a significant portion (40-75 of runtime as models and hardware evolve. Moreover, communication which is hidden by overlapped computation in today's models often cannot be hidden in future, larger models. Overall, this work highlights the increasingly large role communication will play as models scale and discusses techniques and upcoming technologies that can help address it.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2021

Scaling Vision Transformers

Attention-based neural networks such as the Vision Transformer (ViT) hav...
research
06/28/2023

Towards a Better Theoretical Understanding of Independent Subnetwork Training

Modern advancements in large-scale machine learning would be impossible ...
research
02/11/2022

Compute Trends Across Three Eras of Machine Learning

Compute, data, and algorithmic advances are the three fundamental factor...
research
05/01/2020

Multi-scale Transformer Language Models

We investigate multi-scale transformer language models that learn repres...
research
02/02/2023

A Survey on Efficient Training of Transformers

Recent advances in Transformers have come with a huge requirement on com...
research
12/11/2020

Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment

The scaling hypothesis motivates the expansion of models past trillions ...
research
01/17/2023

AutoDDL: Automatic Distributed Deep Learning with Asymptotically Optimal Communication

Recent advances in deep learning base on growing model sizes and the nec...

Please sign up or login with your details

Forgot password? Click here to reset