Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

by   Xiao Wang, et al.

Pretrained language models have achieved remarkable success in various natural language processing tasks. However, pretraining has recently shifted toward larger models and larger data, and this has resulted in significant computational and energy costs. In this paper, we propose Influence Subset Selection (ISS) for language model, which explicitly utilizes end-task knowledge to select a tiny subset of the pretraining corpus. Specifically, the ISS selects the samples that will provide the most positive influence on the performance of the end-task. Furthermore, we design a gradient matching based influence estimation method, which can drastically reduce the computation time of influence. With only 0.45 computational cost, ISS outperformed pretrained models (e.g., RoBERTa) on eight datasets covering four domains.


page 1

page 2

page 3

page 4


Pretrained Language Model Embryology: The Birth of ALBERT

While behaviors of pretrained language models (LMs) have been thoroughly...

Pseudo Zero Pronoun Resolution Improves Zero Anaphora Resolution

The use of pretrained masked language models (MLMs) has drastically impr...

Back-Translated Task Adaptive Pretraining: Improving Accuracy and Robustness on Text Classification

Language models (LMs) pretrained on a large text corpus and fine-tuned o...

Byte Pair Encoding is Suboptimal for Language Model Pretraining

The success of pretrained transformer language models in natural languag...

Automatic Document Selection for Efficient Encoder Pretraining

Building pretrained language models is considered expensive and data-int...

Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling

Generative language models (LMs) have become omnipresent across data sci...

A Compact Pretraining Approach for Neural Language Models

Domain adaptation for large neural language models (NLMs) is coupled wit...

Please sign up or login with your details

Forgot password? Click here to reset