DeepAI AI Chat
Log In Sign Up

An Empirical Exploration in Quality Filtering of Text Data

by   Leo Gao, et al.

While conventional wisdom suggests that more aggressively filtering data from low-quality sources like Common Crawl always monotonically improves the quality of training data, we find that aggressive filtering can in fact lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective, suggesting a need for more robust filtering objectives when attempting to filter more aggressively. We hope this work leads to detailed analysis of the effects of dataset filtering design choices on downstream model performance in future work.


page 1

page 2

page 3

page 4


DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

The mixture proportions of pretraining data domains (e.g., Wikipedia, bo...

Does Corpus Quality Really Matter for Low-Resource Languages?

The vast majority of non-English corpora are derived from automatically ...

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, Toxicity

Pretraining is the preliminary and fundamental step in developing capabl...

Studying the impacts of pre-training using ChatGPT-generated text on downstream tasks

In recent times, significant advancements have been witnessed in the fie...

RedCaps: web-curated image-text data created by the people, for the people

Large datasets of paired images and text have become increasingly popula...

MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling

Vision-and-Language Pre-training (VLP) improves model performance for do...

CiT: Curation in Training for Effective Vision-Language Data

Large vision-language models are generally applicable to many downstream...