Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

by   Curtis G. Northcutt, et al.

We algorithmically identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets are numerous and widespread: we estimate an average of 3.4 errors comprise 6 found using confident learning and then human-validated via crowdsourcing (54 of the algorithmically-flagged candidates are indeed erroneously labeled). Surprisingly, we find that lower capacity models may be practically more useful than higher capacity models in real-world datasets with high proportions of erroneously labeled data. For example, on ImageNet with corrected labels: ResNet-18 outperforms ResNet-50 if the prevalence of originally mislabeled test examples increases by just 6 outperforms VGG-19 if the prevalence of originally mislabeled test examples increases by 5 based on test accuracy – our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets.


page 2

page 7


Identifying Label Errors in Object Detection Datasets by Loss Inspection

Labeling datasets for supervised object detection is a dull and time-con...

Learning with Bounded Instance- and Label-dependent Label Noise

Instance- and label-dependent label noise (ILN) is widely existed in rea...

GMM Discriminant Analysis with Noisy Label for Each Class

Real world datasets often contain noisy labels, and learning from such d...

Channel-Wise Contrastive Learning for Learning with Noisy Labels

In real-world datasets, noisy labels are pervasive. The challenge of lea...

Identifying Incorrect Annotations in Multi-Label Classification Data

In multi-label classification, each example in a dataset may be annotate...

Does progress on ImageNet transfer to real-world datasets?

Does progress on ImageNet transfer to real-world datasets? We investigat...

Evaluating Bayes Error Estimators on Read-World Datasets with FeeBee

The Bayes error rate (BER) is a fundamental concept in machine learning ...

Please sign up or login with your details

Forgot password? Click here to reset