Quality of Data in Machine Learning

by   Antti Kariluoto, et al.

A common assumption exists according to which machine learning models improve their performance when they have more data to learn from. In this study, the authors wished to clarify the dilemma by performing an empirical experiment utilizing novel vocational student data. The experiment compared different machine learning algorithms while varying the number of data and feature combinations available for training and testing the models. The experiment revealed that the increase of data records or their sample frequency does not immediately lead to significant increases in the model accuracies or performance, however the variance of accuracies does diminish in the case of ensemble models. Similar phenomenon was witnessed while increasing the number of input features for the models. The study refutes the starting assumption and continues to state that in this case the significance in data lies in the quality of the data instead of the quantity of the data.


page 1

page 2

page 3

page 4


Experimental Design for Bathymetry Editing

We describe an application of machine learning to a real-world computer ...

The Impact of Feature Quantity on Recommendation Algorithm Performance: A Movielens-100K Case Study

Recent model-based Recommender Systems (RecSys) algorithms emphasize on ...

A Data Quality-Driven View of MLOps

Developing machine learning models can be seen as a process similar to t...

Machine Learning for Antimicrobial Resistance

Biological datasets amenable to applied machine learning are more availa...

Systematic Training and Testing for Machine Learning Using Combinatorial Interaction Testing

This paper demonstrates the systematic use of combinatorial coverage for...

Automatic Generation of Synthetic Colonoscopy Videos for Domain Randomization

An increasing number of colonoscopic guidance and assistance systems rel...

Please sign up or login with your details

Forgot password? Click here to reset