Distances with mixed type variables some modified Gower's coefficients

by   Marcello D'Orazio, et al.

Nearest neighbor methods have become popular in official statistics, mainly in imputation or in statistical matching problems; they play a key role in machine learning too, where a high number of variants have been proposed. The choice of the distance function depends mainly on the type of the selected variables. Unfortunately, relatively few options permit to handle mixed type variables, a situation frequently encountered in official statistics. The most popular distance for mixed type variables is derived as the complement of the Gower's similarity coefficient; it is appealing because ranges between 0 and 1 and allows to handle missing values. Unfortunately, the unweighted standard setting the contribution of the single variables to the overall Gower's distance is unbalanced because of the different nature of the variables themselves. This article tries to address the main drawbacks that affect the overall unweighted Gower's distance by suggesting some modifications in calculating the distance on the interval and ratio scaled variables. Simple modifications try to attenuate the impact of outliers on the scaled Manhattan distance; other modifications, relying on the kernel density estimation methods attempt to reduce the unbalanced contribution of the different types of variables. The performance of the proposals is evaluated in simulations mimicking the imputation of missing values through nearest neighbor distance hotdeck method.


page 12

page 15


Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach

Numerical data imputation algorithms replace missing values by estimates...

Nearest neighbor ratio imputation with incomplete multi-nomial outcome in survey sampling

Nonresponse is a common problem in survey sampling. Appropriate treatmen...

Weak consistency of the 1-nearest neighbor measure with applications to missing data and covariate shift

When data is partially missing at random, imputation and importance weig...

Asymptotically Exact and Fast Gaussian Copula Models for Imputation of Mixed Data Types

Missing values with mixed data types is a common problem in a large numb...

Distance and Similarity Measures Effect on the Performance of K-Nearest Neighbor Classifier - A Review

The K-nearest neighbor (KNN) classifier is one of the simplest and most ...

Data Fusion for Joining Income and Consumption Information Using Different Donor-Recipient Distance Metrics

Data fusion describes the method of combining data from (at least) two i...

A general framework for implementing distances for categorical variables

The degree to which subjects differ from each other with respect to cert...

Please sign up or login with your details

Forgot password? Click here to reset