Unified Scaling Laws for Routed Language Models

02/02/2022
by   Aidan Clark, et al.
2

The performance of a language model has been shown to be effectively modeled as a power-law in its parameter count. Here we study the scaling behaviors of Routing Networks: architectures that conditionally use only a subset of their parameters while processing an input. For these models, parameter count and computational requirement form two independent axes along which an increase leads to better performance. In this work we derive and justify scaling laws defined on these two variables which generalize those known for standard language models and describe the performance of a wide range of routing architectures trained via three different techniques. Afterwards we provide two applications of these laws: first deriving an Effective Parameter Count along which all models scale at the same rate, and then using the scaling coefficients to give a quantitative comparison of the three routing techniques considered. Our analysis derives from an extensive evaluation of Routing Networks across five orders of magnitude of size, including models with hundreds of experts and hundreds of billions of parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/13/2022

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

Neural scaling laws define a predictable relationship between a model's ...
research
09/24/2021

Is the Number of Trainable Parameters All That Actually Matters?

Recent work has identified simple empirical scaling laws for language mo...
research
07/05/2022

Machine Learning Model Sizes and the Parameter Gap

We study trends in model size of notable machine learning systems over t...
research
11/20/2022

An Algorithm for Routing Vectors in Sequences

We propose a routing algorithm that takes a sequence of vectors and comp...
research
05/26/2023

Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale

In recent years, language models have drastically grown in size, and the...
research
01/10/2023

Scaling Laws for Generative Mixed-Modal Language Models

Generative language models define distributions over sequences of tokens...
research
07/29/2023

A Theory for Emergence of Complex Skills in Language Models

A major driver of AI products today is the fact that new skills emerge i...

Please sign up or login with your details

Forgot password? Click here to reset