On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model

by   Seongjin Shin, et al.

Many recent studies on large-scale language models have reported successful in-context zero- and few-shot learning ability. However, the in-depth analysis of when in-context learning occurs is still lacking. For example, it is unknown how in-context learning performance changes as the training corpus varies. Here, we investigate the effects of the source and size of the pretraining corpus on in-context learning in HyperCLOVA, a Korean-centric GPT-3 model. From our in-depth investigation, we introduce the following observations: (1) in-context learning performance heavily depends on the corpus domain source, and the size of the pretraining corpus does not necessarily determine the emergence of in-context learning, (2) in-context learning ability can emerge when a language model is trained on a combination of multiple corpora, even when each corpus does not result in in-context learning on its own, (3) pretraining with a corpus related to a downstream task does not always guarantee the competitive in-context learning performance of the downstream task, especially in the few-shot setting, and (4) the relationship between language modeling (measured in perplexity) and in-context learning does not always correlate: e.g., low perplexity does not always imply high in-context few-shot learning performance.


What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers

GPT-3 shows remarkable in-context learning ability of large-scale langua...

An Explanation of In-context Learning as Implicit Bayesian Inference

Large pretrained language models such as GPT-3 have the surprising abili...

Improving Large-scale Language Models and Resources for Filipino

In this paper, we improve on existing language resources for the low-res...

ORCA: Interpreting Prompted Language Models via Locating Supporting Data Evidence in the Ocean of Pretraining Data

Large pretrained language models have been performing increasingly well ...

Continued Pretraining for Better Zero- and Few-Shot Promptability

Recently introduced language model prompting methods can achieve high ac...

FLEX: Unifying Evaluation for Few-Shot NLP

Few-shot NLP research is highly active, yet conducted in disjoint resear...

Cramming: Training a Language Model on a Single GPU in One Day

Recent trends in language modeling have focused on increasing performanc...

Please sign up or login with your details

Forgot password? Click here to reset