An Empirical Exploration in Quality Filtering of Text Data

09/02/2021
by   Leo Gao, et al.
0

While conventional wisdom suggests that more aggressively filtering data from low-quality sources like Common Crawl always monotonically improves the quality of training data, we find that aggressive filtering can in fact lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective, suggesting a need for more robust filtering objectives when attempting to filter more aggressively. We hope this work leads to detailed analysis of the effects of dataset filtering design choices on downstream model performance in future work.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2023

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

The mixture proportions of pretraining data domains (e.g., Wikipedia, bo...
research
03/15/2022

Does Corpus Quality Really Matter for Low-Resource Languages?

The vast majority of non-English corpora are derived from automatically ...
research
05/22/2023

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, Toxicity

Pretraining is the preliminary and fundamental step in developing capabl...
research
09/02/2023

Studying the impacts of pre-training using ChatGPT-generated text on downstream tasks

In recent times, significant advancements have been witnessed in the fie...
research
11/22/2021

RedCaps: web-curated image-text data created by the people, for the people

Large datasets of paired images and text have become increasingly popula...
research
09/24/2021

MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling

Vision-and-Language Pre-training (VLP) improves model performance for do...
research
01/05/2023

CiT: Curation in Training for Effective Vision-Language Data

Large vision-language models are generally applicable to many downstream...

Please sign up or login with your details

Forgot password? Click here to reset