A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification

by   Shigang Liu, et al.

Biomedical data are widely accepted in developing prediction models for identifying a specific tumor, drug discovery and classification of human cancers. However, previous studies usually focused on different classifiers, and overlook the class imbalance problem in real-world biomedical datasets. There are a lack of studies on evaluation of data pre-processing techniques, such as resampling and feature selection, on imbalanced biomedical data learning. The relationship between data pre-processing techniques and the data distributions has never been analysed in previous studies. This article mainly focuses on reviewing and evaluating some popular and recently developed resampling and feature selection methods for class imbalance learning. We analyse the effectiveness of each technique from data distribution perspective. Extensive experiments have been done based on five classifiers, four performance measures, eight learning techniques across twenty real-world datasets. Experimental results show that: (1) resampling and feature selection techniques exhibit better performance using support vector machine (SVM) classifier. However, resampling and Feature Selection techniques perform poorly when using C4.5 decision tree and Linear discriminant analysis classifiers; (2) for datasets with different distributions, techniques such as Random undersampling and Feature Selection perform better than other data pre-processing methods with T Location-Scale distribution when using SVM and KNN (K-nearest neighbours) classifiers. Random oversampling outperforms other methods on Negative Binomial distribution using Random Forest classifier with lower level of imbalance ratio; (3) Feature Selection outperforms other data pre-processing methods in most cases, thus, Feature Selection with SVM classifier is the best choice for imbalanced biomedical data learning.


page 13

page 14

page 16

page 17


An Empirical Study on the Joint Impact of Feature Selection and Data Resampling on Imbalance Classification

Real-world datasets often present different degrees of imbalanced (i.e.,...

Learning Classifiers for Imbalanced and Overlapping Data

This study is about inducing classifiers using data that is imbalanced, ...

Machine Learning on Biomedical Images: Interactive Learning, Transfer Learning, Class Imbalance, and Beyond

In this paper, we highlight three issues that limit performance of machi...

A Characterization of the Combined Effects of Overlap and Imbalance on the SVM Classifier

In this paper we demonstrate that two common problems in Machine Learnin...

A Comprehensive Pipeline for Hotel Recommendation System

This paper addresses a comprehensive pipeline to build a hotel recommend...

Revisiting the Application of Feature Selection Methods to Speech Imagery BCI Datasets

Brain-computer interface (BCI) aims to establish and improve human and c...

Using Kernel Methods and Model Selection for Prediction of Preterm Birth

We describe an application of machine learning to the problem of predict...

Please sign up or login with your details

Forgot password? Click here to reset