Scaling Laws for Sparsely-Connected Foundation Models

09/15/2023
by   Elias Frantar, et al.
0

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the "optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model). Our findings shed light on the power and limitations of weight sparsity across various parameter and computational settings, offering both theoretical understanding and practical implications for leveraging sparsity towards computational efficiency improvements.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/17/2022

Understanding Scaling Laws for Recommendation Models

Scale has been a major driving force in improving machine learning perfo...
research
02/13/2022

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

Neural scaling laws define a predictable relationship between a model's ...
research
10/30/2022

A Solvable Model of Neural Scaling Laws

Large language models with a huge number of parameters, when trained on ...
research
10/28/2020

Scaling Laws for Autoregressive Generative Modeling

We identify empirical scaling laws for the cross-entropy loss in four do...
research
05/22/2023

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

Scaling laws have been recently employed to derive compute-optimal model...
research
10/13/2021

Scaling Laws for the Few-Shot Adaptation of Pre-trained Image Classifiers

Empirical science of neural scaling laws is a rapidly growing area of si...
research
10/26/2017

Phase Transitions in Image Denoising via Sparsely Coding Convolutional Neural Networks

Neural networks are analogous in many ways to spin glasses, systems whic...

Please sign up or login with your details

Forgot password? Click here to reset