Tandem clustering with invariant coordinate selection

by   Andreas Alfons, et al.

For high-dimensional data or data with noise variables, tandem clustering is a well-known technique that aims to improve cluster identification by first reducing the dimension. However, the usual approach using principal component analysis (PCA) has been criticized for focusing only on inertia so that the first components do not necessarily retain the structure of interest for clustering. To overcome this drawback, we propose a new tandem clustering approach based on invariant coordinate selection (ICS). By jointly diagonalizing two scatter matrices, ICS is designed to find structure in the data while returning affine invariant components. Some theoretical results have already been derived and guarantee that under some elliptical mixture models, the structure of the data can be highlighted on a subset of the first and/or last components. Nevertheless, ICS has received little attention in a clustering context. Two challenges are the choice of the pair of scatter matrices and the selection of the components to retain. For clustering purposes, we demonstrate that the best scatter pairs consist of one scatter matrix that captures the within-cluster structure and another that captures the global structure. For the former, local shape or pairwise scatters are of great interest, as is the minimum covariance determinant (MCD) estimator based on a carefully selected subset size that is smaller than usual. We evaluate the performance of ICS as a dimension reduction method in terms of preserving the cluster structure present in data. In an extensive simulation study and in empirical applications with benchmark data sets, we compare different combinations of scatter matrices, component selection criteria, and the impact of outliers. Overall, the new approach of tandem clustering with ICS shows promising results and clearly outperforms the approach with PCA.


Homogeneity and Sub-homogeneity Pursuit: Iterative Complement Clustering PCA

Principal component analysis (PCA), the most popular dimension-reduction...

Subspace clustering of high-dimensional data: a predictive approach

In several application domains, high-dimensional observations are collec...

Numerical considerations and a new implementation for ICS

Invariant Coordinate Selection (ICS) is a multivariate data transformati...

Dimension reduction for model-based clustering

We introduce a dimension reduction method for visualizing the clustering...

An Effective and Efficient Approach for Clusterability Evaluation

Clustering is an essential data mining tool that aims to discover inhere...

Comparison of Clustering Algorithms for Statistical Features of Vibration Data Sets

Vibration-based condition monitoring systems are receiving increasing at...

Please sign up or login with your details

Forgot password? Click here to reset