Scaling Laws for Autoregressive Generative Modeling

by   Tom Henighan, et al.

We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image↔text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as S(True) + D_KL(True||Model), and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an 8× 8 resolution, and we can forecast the model size needed to achieve any given reducible loss (ie D_KL) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.


page 11

page 24

page 25

page 29


Scaling Laws for Acoustic Models

There is a recent trend in machine learning to increase model quality by...

Scaling laws for single-agent reinforcement learning

Recent work has shown that, in generative modeling, cross-entropy loss i...

Scaling Laws for Neural Language Models

We study empirical scaling laws for language model performance on the cr...

A Theory for Emergence of Complex Skills in Language Models

A major driver of AI products today is the fact that new skills emerge i...

Scaling Laws for Sparsely-Connected Foundation Models

We explore the impact of parameter sparsity on the scaling behavior of T...

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

Scaling laws have been recently employed to derive compute-optimal model...

Scaling Laws Beyond Backpropagation

Alternatives to backpropagation have long been studied to better underst...

Please sign up or login with your details

Forgot password? Click here to reset