RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification

05/09/2021
by   Michał Koziarski, et al.
10

Real-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. However, the state-of-the-art methods ignore the local joint distribution of the data or correct it as a post-processing step. This can causes sub-optimal shifts in the training distribution, particularly when the target data distribution is complex. In this paper, we propose Radial-Based Combined Cleaning and Resampling (RB-CCR). RB-CCR utilizes the concept of class potential to refine the energy-based resampling approach of CCR. In particular, RB-CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. The category sub-region for oversampling can be specified as an input parameter to meet domain-specific needs or be automatically selected via cross-validation. Our 5×2 cross-validated results on 57 benchmark binary datasets with 9 classifiers show that RB-CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.

READ FULL TEXT

page 8

page 9

research
04/17/2021

Potential Anchoring for imbalanced data classification

Data imbalance remains one of the factors negatively affecting the perfo...
research
07/06/2022

A Hybrid Approach for Binary Classification of Imbalanced Data

Binary classification with an imbalanced dataset is challenging. Models ...
research
01/05/2019

Deep Reinforcement Learning for Imbalanced Classification

Data in real-world application often exhibit skewed class distribution w...
research
06/25/2023

DiffMix: Diffusion Model-based Data Synthesis for Nuclei Segmentation and Classification in Imbalanced Pathology Image Datasets

Nuclei segmentation and classification is a significant process in patho...
research
08/05/2015

Empirical Similarity for Absent Data Generation in Imbalanced Classification

When the training data in a two-class classification problem is overwhel...
research
11/15/2020

Precision-Recall Curve (PRC) Classification Trees

The classification of imbalanced data has presented a significant challe...
research
03/23/2018

A Concept Learning Tool Based On Calculating Version Space Cardinality

In this paper, we proposed VeSC-CoL (Version Space Cardinality based Con...

Please sign up or login with your details

Forgot password? Click here to reset