Naive imputation implicitly regularizes high-dimensional linear models

01/31/2023
by   Alexis Ayme, et al.
0

Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the d √($) n regime. Experiments illustrate our findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/21/2018

An ensemble learning method for variable selection: application to high dimensional data and missing values

Standard approaches for variable selection in linear models are not tail...
research
11/06/2021

In Nonparametric and High-Dimensional Models, Bayesian Ignorability is an Informative Prior

In problems with large amounts of missing data one must model two distin...
research
05/13/2019

Multiple imputation using dimension reduction techniques for high-dimensional data

Missing data present challenges in data analysis. Naive analyses such as...
research
11/16/2022

The Missing Indicator Method: From Low to High Dimensions

Missing data is common in applied data science, particularly for tabular...
research
05/30/2022

Partial Replacement Imputation Estimation Method for Complex Missing Covariates in Additive Partially Linear Models

Missing data is a common problem in clinical data collection, which caus...
research
10/01/2020

When to Impute? Imputation before and during cross-validation

Cross-validation (CV) is a technique used to estimate generalization err...
research
12/14/2021

Navigating the corporate disclosure gap: Modelling of Missing Not at Random Carbon Data

Corporate carbon emissions data is disclosed by approximately 65 and mid...

Please sign up or login with your details

Forgot password? Click here to reset