Automatically detecting data drift in machine learning classifiers

11/10/2021
by   Samuel Ackerman, et al.
10

Classifiers and other statistics-based machine learning (ML) techniques generalize, or learn, based on various statistical properties of the training data. The assumption underlying statistical ML resulting in theoretical or empirical performance guarantees is that the distribution of the training data is representative of the production data distribution. This assumption often breaks; for instance, statistical distributions of the data may change. We term changes that affect ML performance `data drift' or `drift'. Many classification techniques compute a measure of confidence in their results. This measure might not reflect the actual ML performance. A famous example is the Panda picture that is correctly classified as such with a confidence of about 60%, but when noise is added it is incorrectly classified as a Gibbon with a confidence of above 99%. However, the work we report on here suggests that a classifier's measure of confidence can be used for the purpose of detecting data drift. We propose an approach based solely on classifier suggested labels and its confidence in them, for alerting on data distribution or feature space changes that are likely to cause data drift. Our approach identities degradation in model performance and does not require labeling of data in production which is often lacking or delayed. Our experiments with three different data sets and classifiers demonstrate the effectiveness of this approach in detecting data drift. This is especially encouraging as the classification itself may or may not be correct and no model input data is required. We further explore the statistical approach of sequential change-point tests to automatically determine the amount of data needed in order to identify drift while controlling the false positive rate (Type-1 error).

READ FULL TEXT

page 1

page 2

page 3

page 4

page 6

page 8

research
10/24/2021

Detecting model drift using polynomial relations

Machine learning (ML) models serve critical functions, such as classifyi...
research
08/29/2019

An Auto-ML Framework Based on GBDT for Lifelong Learning

Automatic Machine Learning (Auto-ML) has attracted more and more attenti...
research
01/17/2022

Who supervises the supervisor? Model monitoring in production using deep feature embeddings with applications to workpiece inspection

The automation of condition monitoring and workpiece inspection plays an...
research
08/10/2021

The information of attribute uncertainties: what convolutional neural networks can learn about errors in input data

Errors in measurements are key to weighting the value of data, but are o...
research
11/10/2012

Probabilistic Combination of Classifier and Cluster Ensembles for Non-transductive Learning

Unsupervised models can provide supplementary soft constraints to help c...
research
05/22/2023

Mitigating ML Model Decay in Continuous Integration with Data Drift Detection: An Empirical Study

Background: Machine Learning (ML) methods are being increasingly used fo...
research
04/21/2022

The Silent Problem – Machine Learning Model Failure – How to Diagnose and Fix Ailing Machine Learning Models

The COVID-19 pandemic has dramatically changed how healthcare is deliver...

Please sign up or login with your details

Forgot password? Click here to reset