Generating the Ground Truth: Synthetic Data for Label Noise Research

by   Sjoerd de Vries, et al.

Most real-world classification tasks suffer from label noise to some extent. Such noise in the data adversely affects the generalization error of learned models and complicates the evaluation of noise-handling methods, as their performance cannot be accurately measured without clean labels. In label noise research, typically either noisy or incomplex simulated data are accepted as a baseline, into which additional noise with known properties is injected. In this paper, we propose SYNLABEL, a framework that aims to improve upon the aforementioned methodologies. It allows for creating a noiseless dataset informed by real data, by either pre-specifying or learning a function and defining it as the ground truth function from which labels are generated. Furthermore, by resampling a number of values for selected features in the function domain, evaluating the function and aggregating the resulting labels, each data point can be assigned a soft label or label distribution. Such distributions allow for direct injection and quantification of label noise. The generated datasets serve as a clean baseline of adjustable complexity into which different types of noise may be introduced. We illustrate how the framework can be applied, how it enables quantification of label noise and how it improves over existing methodologies.


page 1

page 2

page 3

page 4


BadLabel: A Robust Perspective on Evaluating and Enhancing Label-noise Learning

Label-noise learning (LNL) aims to increase the model's generalization g...

Beyond Hard Labels: Investigating data label distributions

High-quality data is a key aspect of modern machine learning. However, l...

Who Decides if AI is Fair? The Labels Problem in Algorithmic Auditing

Labelled "ground truth" datasets are routinely used to evaluate and audi...

Supervised Learning in the Presence of Noise: Application in ICD-10 Code Classification

ICD coding is the international standard for capturing and reporting hea...

Mitigating Label Noise through Data Ambiguation

Label noise poses an important challenge in machine learning, especially...

Improving Label Quality by Jointly Modeling Items and Annotators

We propose a fully Bayesian framework for learning ground truth labels f...

NoiseRank: Unsupervised Label Noise Reduction with Dependence Models

Label noise is increasingly prevalent in datasets acquired from noisy ch...

Please sign up or login with your details

Forgot password? Click here to reset