When Do You Need Billions of Words of Pretraining Data?

11/10/2020
by   Yian Zhang, et al.
0

NLP is currently dominated by general-purpose pretrained language models like RoBERTa, which achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? We adopt four probing methods—classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks—and draw learning curves that track the growth of these different measures of linguistic ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words. We find that LMs require only about 10M or 100M words to learn representations that reliably encode most syntactic and semantic features we test. A much larger quantity of data is needed in order to acquire enough commonsense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/06/2021

Does He Wink or Does He Nod? A Challenging Benchmark for Evaluating Word Understanding of Language Models

Recent progress in pretraining language models on large corpora has resu...
research
10/11/2020

Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

One reason pretraining on self-supervised linguistic tasks is effective ...
research
09/07/2021

How much pretraining data do language models need to learn syntax?

Transformers-based pretrained language models achieve outstanding result...
research
10/24/2022

An Empirical Revisiting of Linguistic Knowledge Fusion in Language Understanding Tasks

Though linguistic knowledge emerges during large-scale language model pr...
research
09/26/2018

Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis

Recent work using auxiliary prediction task classifiers to investigate t...
research
09/11/2021

HYDRA – Hyper Dependency Representation Attentions

Attention is all we need as long as we have enough data. Even so, it is ...
research
08/15/2020

Is Supervised Syntactic Parsing Beneficial for Language Understanding? An Empirical Investigation

Traditional NLP has long held (supervised) syntactic parsing necessary f...

Please sign up or login with your details

Forgot password? Click here to reset