BoostClean: Automated Error Detection and Repair for Machine Learning

by   Sanjay Krishnan, et al.

Predictive models based on machine learning can be highly sensitive to data error. Training data are often combined with a variety of different sources, each susceptible to different types of inconsistencies, and new data streams during prediction time, the model may encounter previously unseen inconsistencies. An important class of such inconsistencies is domain value violations that occur when an attribute value is outside of an allowed domain. We explore automatically detecting and repairing such violations by leveraging the often available clean test labels to determine whether a given detection and repair combination will improve model accuracy. We present BoostClean which automatically selects an ensemble of error detection and repair combinations using statistical boosting. BoostClean selects this ensemble from an extensible library that is pre-populated general detection functions, including a novel detector based on the Word2Vec deep learning model, which detects errors across a diverse set of domains. Our evaluation on a collection of 12 datasets from Kaggle, the UCI repository, real-world data analyses, and production datasets that show that Boost- Clean can increase absolute prediction accuracy by up to 9 parallelism, materialization, and indexing techniques show a 22.2x end-to-end speedup on a 16-core machine.


page 1

page 2

page 3

page 4


Repairing Systematic Outliers by Learning Clean Subspaces in VAEs

Data cleaning often comprises outlier detection and data repair. Systema...

CTRL: Clustering Training Losses for Label Error Detection

In supervised machine learning, use of correct labels is extremely impor...

Classification Auto-Encoder based Detector against Diverse Data Poisoning Attacks

Poisoning attacks are a category of adversarial machine learning threats...

Self-Supervised Bug Detection and Repair

Machine learning-based program analyses have recently shown the promise ...

MACER: A Modular Framework for Accelerated Compilation Error Repair

Automated compilation error repair, the problem of suggesting fixes to b...

AutoCure: Automated Tabular Data Curation Technique for ML Pipelines

Machine learning algorithms have become increasingly prevalent in multip...

Towards automatic detection of wildlife trade using machine vision models

Unsustainable trade in wildlife is one of the major threats affecting th...

Please sign up or login with your details

Forgot password? Click here to reset