Who wins the Miss Contest for Imputation Methods? Our Vote for Miss BooPF

11/30/2017
by   Burim Ramosaj, et al.
0

Missing data is an expected issue when large amounts of data is collected, and several imputation techniques have been proposed to tackle this problem. Beneath classical approaches such as MICE, the application of Machine Learning techniques is tempting. Here, the recently proposed missForest imputation method has shown high imputation accuracy under the Missing (Completely) at Random scheme with various missing rates. In its core, it is based on a random forest for classification and regression, respectively. In this paper we study whether this approach can even be enhanced by other methods such as the stochastic gradient tree boosting method, the C5.0 algorithm or modified random forest procedures. In particular, other resampling strategies within the random forest protocol are suggested. In an extensive simulation study, we analyze their performances for continuous, categorical as well as mixed-type data. Therein, MissBooPF, a combination of the stochastic gradient tree boosting method together with the parametrically bootstrapped random forest method, appeared to be promising. Finally, an empirical analysis focusing on credit information and Facebook data is conducted.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2018

A cautionary tale on using imputation methods for inference in matched pairs design

Imputation procedures in biomedical fields have turned into statistical ...
research
05/04/2011

MissForest - nonparametric missing value imputation for mixed-type data

Modern data acquisition based on high-throughput technology is often fac...
research
04/23/2020

Influence of parallel computing strategies of iterative imputation of missing data: a case study on missForest

Machine learning iterative imputation methods have been well accepted by...
research
11/19/2021

MURAL: An Unsupervised Random Forest-Based Embedding for Electronic Health Record Data

A major challenge in embedding or visualizing clinical patient data is t...
research
11/10/2020

On the consistency of a random forest algorithm in the presence of missing entries

This paper tackles the problem of constructing a non-parametric predicto...
research
07/02/2017

Dimensionality reduction with missing values imputation

In this study, we propose a new statical approach for high-dimensionalit...
research
05/29/2018

Winning Models for GPA, Grit, and Layoff in the Fragile Families Challenge

In this paper, we discuss and analyze our approach to the Fragile Famili...

Please sign up or login with your details

Forgot password? Click here to reset