Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

10/12/2018
by   Erich Schubert, et al.
0

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm PAM, partitioning around medoids, also known as k-medoids. In Euclidean geometry the mean--as used in k-means--is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or even more complex distances. A key issue with PAM is, however, its high run time cost. In this paper, we propose modifications to the PAM algorithm where at the cost of storing O(k) additional values, we can achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. We also show how the CLARA and CLARANS algorithms benefit from this modification. In experiments on real data with k=100, we observed a 200 fold speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2022

Clustering by Direct Optimization of the Medoid Silhouette

The evaluation of clustering results is difficult, highly dependent on t...
research
09/07/2023

Medoid Silhouette clustering with automatic cluster number selection

The evaluation of clustering results is difficult, highly dependent on t...
research
07/08/2021

Accelerating Spherical k-Means

Spherical k-means is a widely used clustering algorithm for sparse and h...
research
08/07/2023

Wide Gaps and Clustering Axioms

The widely applied k-means algorithm produces clusterings that violate o...
research
02/01/2022

Gradient Based Clustering

We propose a general approach for distance based clustering, using the g...
research
04/28/2018

Clustering Perturbation Resilient Instances

Euclidean k-means is a problem that is NP-hard in the worst-case but oft...
research
05/06/2021

Exact Acceleration of K-Means++ and K-Means

K-Means++ and its distributed variant K-Means have become de facto tools...

Please sign up or login with your details

Forgot password? Click here to reset