Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

07/21/2022
by   Yi Tay, et al.
0

There have been a lot of interest in the scaling properties of Transformer models. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (transfer)? This paper conducts a systematic study of scaling behaviour of ten diverse model architectures such as Transformers, Switch Transformers, Universal Transformers, Dynamic convolutions, Performers, and recently proposed MLP-Mixers. Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and (2) the best performing model can fluctuate at different scales. We believe that the findings outlined in this work has significant implications to how model architectures are currently evaluated in the community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/22/2021

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

There remain many open questions pertaining to the scaling behaviour of ...
research
06/23/2021

Co-advise: Cross Inductive Bias Distillation

Transformers recently are adapted from the community of natural language...
research
06/08/2021

Scaling Vision Transformers

Attention-based neural networks such as the Vision Transformer (ViT) hav...
research
06/23/2023

Scaling MLPs: A Tale of Inductive Bias

In this work we revisit the most fundamental building block in deep lear...
research
01/15/2021

LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning

While designing inductive bias in neural architectures has been widely s...
research
05/31/2020

Transferring Inductive Biases through Knowledge Distillation

Having the right inductive biases can be crucial in many tasks or scenar...
research
10/19/2020

Parameter Norm Growth During Training of Transformers

The capacity of neural networks like the widely adopted transformer is k...

Please sign up or login with your details

Forgot password? Click here to reset