On the importance of pre-training data volume for compact language models

10/08/2020
by   Vincent Micheli, et al.
0

Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/17/2021

Task-adaptive Pre-training of Language Models with Word Embedding Regularization

Pre-trained language models (PTLMs) acquire domain-independent linguisti...
research
05/19/2022

Are Prompt-based Models Clueless?

Finetuning large pre-trained language models with a task-specific head h...
research
09/26/2022

Towards Simple and Efficient Task-Adaptive Pre-training for Text Classification

Language models are pre-trained using large corpora of generic data like...
research
08/10/2022

Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP

Web-crawled datasets have enabled remarkable generalization capabilities...
research
07/19/2022

On the Usability of Transformers-based models for a French Question-Answering task

For many tasks, state-of-the-art results have been achieved with Transfo...
research
05/31/2023

How to Plant Trees in Language Models: Data and Architectural Effects on the Emergence of Syntactic Inductive Biases

Accurate syntactic representations are essential for robust generalizati...
research
05/11/2023

INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Large Language Models

A salient characteristic of large pre-trained language models (PTLMs) is...

Please sign up or login with your details

Forgot password? Click here to reset