Transformer-Based LM Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

by   Byung-Doh Oh, et al.

Recent psycholinguistic studies have drawn conflicting conclusions about the relationship between the quality of a language model and the ability of its surprisal estimates to predict human reading times, which has been speculated to be due to the large gap in both the amount of training data and model capacity across studies. The current work aims to consolidate these findings by evaluating surprisal estimates from Transformer-based language model variants that vary systematically in the amount of training data and model capacity on their ability to predict human reading times. The results show that surprisal estimates from most variants with contemporary model capacities provide the best fit after seeing about two billion training tokens, after which they begin to diverge from humanlike expectations. Additionally, newly-trained smaller model variants reveal a 'tipping point' at convergence, after which the decrease in language model perplexity begins to result in poorer fits to human reading times. These results suggest that the massive amount of training data is mainly responsible for the poorer fit achieved by surprisal from larger pre-trained language models, and that a certain degree of model capacity is necessary for Transformer-based language models to capture humanlike expectations.


page 1

page 2

page 3

page 4


Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

This work presents a detailed linguistic analysis into why larger Transf...

Probabilistic Predictions of People Perusing: Evaluating Metrics of Language Model Performance for Psycholinguistic Modeling

By positing a relationship between naturalistic reading times and inform...

On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior

Human reading behavior is tuned to the statistics of natural language: t...

Entropy- and Distance-Based Predictors From GPT-2 Attention Patterns Predict Reading Times Over and Above GPT-2 Surprisal

Transformer-based large language models are trained to make predictions ...

Quantifying Memorization Across Neural Language Models

Large language models (LMs) have been shown to memorize parts of their t...

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

We study the compute-optimal trade-off between model and training data s...

Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models

Although masked language models are highly performant and widely adopted...

Please sign up or login with your details

Forgot password? Click here to reset