Consistency of Lloyd's Algorithm Under Perturbations

by   Dhruv Patel, et al.

In the context of unsupervised learning, Lloyd's algorithm is one of the most widely used clustering algorithms. It has inspired a plethora of work investigating the correctness of the algorithm under various settings with ground truth clusters. In particular, in 2016, Lu and Zhou have shown that the mis-clustering rate of Lloyd's algorithm on n independent samples from a sub-Gaussian mixture is exponentially bounded after O(log(n)) iterations, assuming proper initialization of the algorithm. However, in many applications, the true samples are unobserved and need to be learned from the data via pre-processing pipelines such as spectral methods on appropriate data matrices. We show that the mis-clustering rate of Lloyd's algorithm on perturbed samples from a sub-Gaussian mixture is also exponentially bounded after O(log(n)) iterations under the assumptions of proper initialization and that the perturbation is small relative to the sub-Gaussian noise. In canonical settings with ground truth clusters, we derive bounds for algorithms such as k-means++ to find good initializations and thus leading to the correctness of clustering via the main result. We show the implications of the results for pipelines measuring the statistical significance of derived clusters from data such as SigClust. We use these general results to derive implications in providing theoretical guarantees on the misclustering rate for Lloyd's algorithm in a host of applications, including high-dimensional time series, multi-dimensional scaling, and community detection for sparse networks via spectral clustering.


Optimality of Spectral Clustering for Gaussian Mixture Model

Spectral clustering is one of the most popular algorithms to group high ...

Scalable Clustering: Large Scale Unsupervised Learning of Gaussian Mixture Models with Outliers

Clustering is a widely used technique with a long and rich history in a ...

Spectral clustering via adaptive layer aggregation for multi-layer networks

One of the fundamental problems in network analysis is detecting communi...

Spectral clustering in the Gaussian mixture block model

Gaussian mixture block models are distributions over graphs that strive ...

Concentration of kernel matrices with application to kernel spectral clustering

We study the concentration of random kernel matrices around their mean. ...

On some spectral properties of stochastic similarity matrices for data clustering

Clustering in image analysis is a central technique that allows to class...

Adversarially robust clustering with optimality guarantees

We consider the problem of clustering data points coming from sub-Gaussi...

Please sign up or login with your details

Forgot password? Click here to reset