The Missing Indicator Method: From Low to High Dimensions

by   Mike Van Ness, et al.

Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods work only on complete data, thus requiring preprocessing, such as missing value imputation, to work on incomplete data sets. However, imputation discards potentially useful information encoded by the pattern of missing values. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. We show experimentally that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Nonetheless, MIM can increase variance if many of the added indicators are uninformative, causing harm particularly for high-dimensional data sets. To address this issue, we introduce Selective MIM (SMIM), a method that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM across a range of experimental settings, and improves MIM for high-dimensional data.


An ensemble learning method for variable selection: application to high dimensional data and missing values

Standard approaches for variable selection in linear models are not tail...

On the consistency of supervised learning with missing values

In many application settings, the data are plagued with missing features...

Sharing pattern submodels for prediction with missing values

Missing values are unavoidable in many applications of machine learning ...

MissForest - nonparametric missing value imputation for mixed-type data

Modern data acquisition based on high-throughput technology is often fac...

Naive imputation implicitly regularizes high-dimensional linear models

Two different approaches exist to handle missing values for prediction: ...

No imputation without representation

By filling in missing values in datasets, imputation allows these datase...

Adaptive imputation of missing values for incomplete pattern classification

In classification of incomplete pattern, the missing values can either p...

Please sign up or login with your details

Forgot password? Click here to reset