Encoding high-cardinality string categorical variables

07/03/2019
by   Patricio Cerda, et al.
3

Statistical analysis usually requires a vector representation of categorical variables, using for instance one-hot encoding. This encoding strategy is not practical when the number of different categories grows, as it creates high-dimensional feature vectors. Additionally, the corresponding entries in the raw data are often represented as strings, that have additional information not captured by one-hot encoding. Here, we seek low-dimensional vectorial encoding of string categorical variables with high-cardinality. Ideally, these should i) be scalable to a very large number of categories, ii) be interpretable to the end user, and iii) facilitate statistical analysis. We introduce two new encoding approaches for string categories: a Gamma-Poisson matrix factorization on character-level substring counts, and a min-hash encoder, based on min-wise random permutations for fast approximation of the Jaccard similarity between strings. Both approaches are scalable and are suitable for streaming settings. Extensive experiments on real and simulated data show that these encoding methods improve prediction performance for real-life supervised-learning problems with high-cardinality string categorical variables and works as well as standard approaches with clean, low-cardinality ones. We recommend the following: i) if scalability is the main concern, the min-hash encoder is the best option as it does not require any fitting to the data; ii) if interpretability is important, the Gamma-Poisson factorization is a good alternative, as it can be interpreted as one-hot encoding, giving each encoding dimension a feature name that summarizes the substrings captured. Both models remove the need for hand-crafting features and data cleaning of string columns in databases and can be used for feature engineering in online autoML settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2018

Similarity encoding for learning with dirty categorical variables

For statistical learning, categorical variables in a table are usually c...
research
01/30/2023

Machine Learning with High-Cardinality Categorical Features in Actuarial Applications

High-cardinality categorical features are pervasive in actuarial data (e...
research
04/01/2021

Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features

Because most machine learning (ML) algorithms are designed for numerical...
research
08/26/2019

Sufficient Representations for Categorical Variables

Many learning algorithms require categorical data to be transformed into...
research
12/01/2022

Progressive Feature Upgrade in Semi-supervised Learning on Tabular Domain

Recent semi-supervised and self-supervised methods have shown great succ...
research
10/08/2021

Contrastive String Representation Learning using Synthetic Data

String representation Learning (SRL) is an important task in the field o...
research
03/24/2017

Binarsity: a penalization for one-hot encoded features

This paper deals with the problem of large-scale linear supervised learn...

Please sign up or login with your details

Forgot password? Click here to reset