Exploiting Class Learnability in Noisy Data

11/15/2018
by   Matthew Klawonn, et al.
0

In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes.

READ FULL TEXT

page 1

page 7

research
08/27/2023

Semi-Supervised Learning in the Few-Shot Zero-Shot Scenario

Semi-Supervised Learning (SSL) leverages both labeled and unlabeled data...
research
06/22/2011

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

For large, real-world inductive learning problems, the number of trainin...
research
04/17/2018

Multimodal Co-Training for Selecting Good Examples from Webly Labeled Video

We tackle the problem of learning concept classifiers from videos on the...
research
05/27/2020

Data Separability for Neural Network Classifiers and the Development of a Separability Index

In machine learning, the performance of a classifier depends on both the...
research
09/04/2023

Adapting Classifiers To Changing Class Priors During Deployment

Conventional classifiers are trained and evaluated using balanced data s...
research
03/24/2020

A Pitfall of Learning from User-generated Data: In-depth Analysis of Subjective Class Problem

Research in the supervised learning algorithms field implicitly assumes ...
research
11/29/2021

Self-Training of Halfspaces with Generalization Guarantees under Massart Mislabeling Noise Model

We investigate the generalization properties of a self-training algorith...

Please sign up or login with your details

Forgot password? Click here to reset