CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis]

by   Peng Li, et al.

It is widely recognized that the data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly does cleaning affect ML --- ML community usually focuses on the effects of specific types of noises of certain distributions (e.g., mislabels) on certain ML models, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream analytics. We propose the CleanML benchmark that systematically investigates the impact of data cleaning on downstream ML models. The CleanML benchmark currently includes 13 real-world datasets with real errors, five common error types, and seven different ML models. To ensure that our findings are statistically significant, CleanML carefully controls the randomness in ML experiments using statistical hypothesis testing, and also uses the Benjamini-Yekutieli (BY) procedure to control potential false discoveries due to many hypotheses in the benchmark. We obtain many interesting and non-trivial insights, and identify multiple open research directions. We also release the benchmark and hope to invite future studies on the important problems of joint data cleaning and ML.


page 1

page 2

page 3

page 4


Machine Learning in Access Control: A Taxonomy and Survey

An increasing body of work has recognized the importance of exploiting m...

Visual Analytics For Machine Learning: A Data Perspective Survey

The past decade has witnessed a plethora of works that leverage the powe...

REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines

Nowadays, machine learning (ML) plays a vital role in many aspects of ou...

AQuA: A Benchmarking Tool for Label Quality Assessment

Machine learning (ML) models are only as good as the data they are train...

Detect, Distill and Update: Learned DB Systems Facing Out of Distribution Data

Machine Learning (ML) is changing DBs as many DB components are being re...

Addressing contingency in algorithmic misinformation detection: Toward a responsible innovation agenda

Machine learning (ML) enabled classification models are becoming increas...

Please sign up or login with your details

Forgot password? Click here to reset