Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale

05/26/2023
by   Vijeta Deshpande, et al.
0

In recent years, language models have drastically grown in size, and the abilities of these models have been shown to improve with scale. The majority of recent scaling laws studies focused on high-compute high-parameter count settings, leaving the question of when these abilities begin to emerge largely unanswered. In this paper, we investigate whether the effects of pre-training can be observed when the problem size is reduced, modeling a smaller, reduced-vocabulary language. We show the benefits of pre-training with masked language modeling (MLM) objective in models as small as 1.25M parameters, and establish a strong correlation between pre-training perplexity and downstream performance (GLUE benchmark). We examine downscaling effects, extending scaling laws to models as small as  1M parameters. At this scale, we observe a break of the power law for compute-optimal models and show that the MLM loss does not scale smoothly with compute-cost (FLOPs) below 2.2 × 10^15 FLOPs. We also find that adding layers does not always benefit downstream performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/25/2022

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

Language modeling on large-scale datasets leads to impressive performanc...
research
12/28/2022

Cramming: Training a Language Model on a Single GPU in One Day

Recent trends in language modeling have focused on increasing performanc...
research
04/06/2023

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

We study recent research advances that improve large language models thr...
research
02/02/2021

Scaling Laws for Transfer

We study empirical scaling laws for transfer learning between distributi...
research
04/30/2023

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

Pre-trained language models can be surprisingly adept at tasks they were...
research
02/02/2022

Unified Scaling Laws for Routed Language Models

The performance of a language model has been shown to be effectively mod...
research
05/30/2023

Likelihood-Based Diffusion Language Models

Despite a growing interest in diffusion-based language models, existing ...

Please sign up or login with your details

Forgot password? Click here to reset