DeepAI AI Chat
Log In Sign Up

Naive imputation implicitly regularizes high-dimensional linear models

01/31/2023
by   Alexis Ayme, et al.
0

Two different approaches exist to handle missing values for prediction: either imputation, prior to fitting any predictive algorithms, or dedicated methods able to natively incorporate missing values. While imputation is widely (and easily) use, it is unfortunately biased when low-capacity predictors (such as linear models) are applied afterward. However, in practice, naive imputation exhibits good predictive performance. In this paper, we study the impact of imputation in a high-dimensional linear model with MCAR missing data. We prove that zero imputation performs an implicit regularization closely related to the ridge method, often used in high-dimensional problems. Leveraging on this connection, we establish that the imputation bias is controlled by a ridge bias, which vanishes in high dimension. As a predictor, we argue in favor of the averaged SGD strategy, applied to zero-imputed data. We establish an upper bound on its generalization error, highlighting that imputation is benign in the d √($) n regime. Experiments illustrate our findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

08/21/2018

An ensemble learning method for variable selection: application to high dimensional data and missing values

Standard approaches for variable selection in linear models are not tail...
11/06/2021

In Nonparametric and High-Dimensional Models, Bayesian Ignorability is an Informative Prior

In problems with large amounts of missing data one must model two distin...
05/13/2019

Multiple imputation using dimension reduction techniques for high-dimensional data

Missing data present challenges in data analysis. Naive analyses such as...
11/16/2022

The Missing Indicator Method: From Low to High Dimensions

Missing data is common in applied data science, particularly for tabular...
05/30/2022

Partial Replacement Imputation Estimation Method for Complex Missing Covariates in Additive Partially Linear Models

Missing data is a common problem in clinical data collection, which caus...
10/01/2020

When to Impute? Imputation before and during cross-validation

Cross-validation (CV) is a technique used to estimate generalization err...
12/14/2021

Navigating the corporate disclosure gap: Modelling of Missing Not at Random Carbon Data

Corporate carbon emissions data is disclosed by approximately 65 and mid...