Scaling Laws for Generative Mixed-Modal Language Models

by   Armen Aghajanyan, et al.

Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.


page 8

page 17

page 18


A Solvable Model of Neural Scaling Laws

Large language models with a huge number of parameters, when trained on ...

Training Compute-Optimal Large Language Models

We investigate the optimal model size and number of tokens for training ...

Scaling Laws for Acoustic Models

There is a recent trend in machine learning to increase model quality by...

The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

The Languini Kitchen serves as both a research collective and codebase d...

Unified Scaling Laws for Routed Language Models

The performance of a language model has been shown to be effectively mod...

Pragmatic Constraint on Distributional Semantics

This paper studies the limits of language models' statistical learning i...

A Theory for Emergence of Complex Skills in Language Models

A major driver of AI products today is the fact that new skills emerge i...

Please sign up or login with your details

Forgot password? Click here to reset