DeepAI AI Chat
Log In Sign Up

Naive imputation implicitly regularizes high-dimensional linear models

by   Alexis Ayme, et al.

Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the d √($) n regime. Experiments illustrate our findings.


page 1

page 2

page 3

page 4


An ensemble learning method for variable selection: application to high dimensional data and missing values

Standard approaches for variable selection in linear models are not tail...

In Nonparametric and High-Dimensional Models, Bayesian Ignorability is an Informative Prior

In problems with large amounts of missing data one must model two distin...

Multiple imputation using dimension reduction techniques for high-dimensional data

Missing data present challenges in data analysis. Naive analyses such as...

The Missing Indicator Method: From Low to High Dimensions

Missing data is common in applied data science, particularly for tabular...

Partial Replacement Imputation Estimation Method for Complex Missing Covariates in Additive Partially Linear Models

Missing data is a common problem in clinical data collection, which caus...

When to Impute? Imputation before and during cross-validation

Cross-validation (CV) is a technique used to estimate generalization err...

Navigating the corporate disclosure gap: Modelling of Missing Not at Random Carbon Data

Corporate carbon emissions data is disclosed by approximately 65 and mid...