Evaluating categorical encoding methods on a real credit card fraud detection database

Correctly dealing with categorical data in a supervised learning context is still a major issue. Furthermore, though some machine learning methods embody builtin methods to deal with categorical features, it is unclear whether they bring some improvements and how do they compare with usual categorical encoding methods. In this paper, we describe several well-known categorical encoding methods that are based on target statistics and weight of evidence. We apply them on a large and real credit card fraud detection database. Then, we train the encoded databases using state-of-the-art gradient boosting methods and evaluate their performances. We show that categorical encoding methods generally bring substantial improvements with respect to the absence of encoding. The contribution of this work is twofold: (1) we compare many state-of-the-art "lite" categorical encoding methods on a large scale database and (2) we use a real credit card fraud detection database.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/27/2022

Evaluating resampling methods on a real-life highly imbalanced online credit card payments dataset

Various problems of any credit card fraud detection based on machine lea...
research
01/27/2022

Fairness implications of encoding protected categorical attributes

Protected attributes are often presented as categorical features that ne...
research
07/20/2016

Indebted households profiling: a knowledge discovery from database approach

A major challenge in consumer credit risk portfolio management is to cla...
research
09/08/2022

Stochastic gradient descent with gradient estimator for categorical features

Categorical data are present in key areas such as health or supply chain...
research
06/01/2020

Sampling Techniques in Bayesian Target Encoding

Target encoding is an effective encoding technique of categorical variab...
research
07/08/2020

StructureBoost: Efficient Gradient Boosting for Structured Categorical Variables

Gradient boosting methods based on Structured Categorical Decision Trees...
research
10/25/2022

Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings

In this paper, we introduce the Vehicle Claims dataset, consisting of fr...

Please sign up or login with your details

Forgot password? Click here to reset