Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

04/01/2021
by   Florian Pargent, et al.
0

Because most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect during data analysis. An often encountered problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of those techniques on a subsequent algorithm's predictive performance, and – if possible – derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbours, support vector machine) using datasets from regression, binary- and multiclass- classification settings. Throughout our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditional encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective.

READ FULL TEXT
research
07/05/2023

A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables

High-cardinality categorical variables are variables for which the numbe...
research
06/01/2020

Sampling Techniques in Bayesian Target Encoding

Target encoding is an effective encoding technique of categorical variab...
research
11/29/2021

PCA-based Category Encoder for Categorical to Numerical Variable Conversion

Increasing the cardinality of categorical variables might decrease the o...
research
04/30/2019

Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine

Applied Data Scientists throughout various industries are commonly faced...
research
02/11/2020

Improved prediction of soil properties with Multi-target Stacked Generalisation on EDXRF spectra

Machine Learning (ML) algorithms have been used for assessing soil quali...
research
07/03/2019

Encoding high-cardinality string categorical variables

Statistical analysis usually requires a vector representation of categor...
research
01/30/2023

Machine Learning with High-Cardinality Categorical Features in Actuarial Applications

High-cardinality categorical features are pervasive in actuarial data (e...

Please sign up or login with your details

Forgot password? Click here to reset