Integrating K-means with Quadratic Programming Feature Selection

by   Yamuna Prasad, et al.

Several data mining problems are characterized by data in high dimensions. One of the popular ways to reduce the dimensionality of the data is to perform feature selection, i.e, select a subset of relevant and non-redundant features. Recently, Quadratic Programming Feature Selection (QPFS) has been proposed which formulates the feature selection problem as a quadratic program. It has been shown to outperform many of the existing feature selection methods for a variety of applications. Though, better than many existing approaches, the running time complexity of QPFS is cubic in the number of features, which can be quite computationally expensive even for moderately sized datasets. In this paper we propose a novel method for feature selection by integrating k-means clustering with QPFS. The basic variant of our approach runs k-means to bring down the number of features which need to be passed on to QPFS. We then enhance this idea, wherein we gradually refine the feature space from a very coarse clustering to a fine-grained one, by interleaving steps of QPFS with k-means clustering. Every step of QPFS helps in identifying the clusters of irrelevant features (which can then be thrown away), whereas every step of k-means further refines the clusters which are potentially relevant. We show that our iterative refinement of clusters is guaranteed to converge. We provide bounds on the number of distance computations involved in the k-means algorithm. Further, each QPFS run is now cubic in number of clusters, which can be much smaller than actual number of features. Experiments on eight publicly available datasets show that our approach gives significant computational gains (both in time and memory), over standard QPFS as well as other state of the art feature selection methods, even while improving the overall accuracy.


Review of Swarm Intelligence-based Feature Selection Methods

In the past decades, the rapid growth of computer and database technolog...

Relevant based structure learning for feature selection

Feature selection is an important task in many problems occurring in pat...

Discovering Conditionally Salient Features with Statistical Guarantees

The goal of feature selection is to identify important features that are...

Parameterized Complexity of Feature Selection for Categorical Data Clustering

We develop new algorithmic methods with provable guarantees for feature ...

Relief-Based Feature Selection: Introduction and Review

Feature selection plays a critical role in data mining, driven by increa...

Distributed ReliefF based Feature Selection in Spark

Feature selection (FS) is a key research area in the machine learning an...

Generating Redundant Features with Unsupervised Multi-Tree Genetic Programming

Recently, feature selection has become an increasingly important area of...

Please sign up or login with your details

Forgot password? Click here to reset