Similarity encoding for learning with dirty categorical variables

06/04/2018
by   Patricio Cerda, et al.
0

For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. "Dirty" non-curated data gives rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding, that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in prediction in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinality, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperforms classic encoding approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/03/2019

Encoding high-cardinality string categorical variables

Statistical analysis usually requires a vector representation of categor...
research
05/29/2020

Quasi-orthonormal Encoding for Machine Learning Applications

Most machine learning models, especially artificial neural networks, req...
research
08/26/2019

Sufficient Representations for Categorical Variables

Many learning algorithms require categorical data to be transformed into...
research
01/30/2023

Machine Learning with High-Cardinality Categorical Features in Actuarial Applications

High-cardinality categorical features are pervasive in actuarial data (e...
research
10/22/2018

Introducing Curvature to the Label Space

One-hot encoding is a labelling system that embeds classes as standard b...
research
06/01/2020

Sampling Techniques in Bayesian Target Encoding

Target encoding is an effective encoding technique of categorical variab...
research
04/30/2019

Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine

Applied Data Scientists throughout various industries are commonly faced...

Please sign up or login with your details

Forgot password? Click here to reset