Regularization and Global Optimization in Model-Based Clustering

by   Raphael Araujo Sampaio, et al.

Due to their conceptual simplicity, k-means algorithm variants have been extensively used for unsupervised cluster analysis. However, one main shortcoming of these algorithms is that they essentially fit a mixture of identical spherical Gaussians to data that vastly deviates from such a distribution. In comparison, general Gaussian Mixture Models (GMMs) can fit richer structures but require estimating a quadratic number of parameters per cluster to represent the covariance matrices. This poses two main issues: (i) the underlying optimization problems are challenging due to their larger number of local minima, and (ii) their solutions can overfit the data. In this work, we design search strategies that circumvent both issues. We develop efficient global optimization algorithms for general GMMs, and we combine these algorithms with regularization strategies that avoid overfitting. Through extensive computational analyses, we observe that global optimization or regularization in isolation does not substantially improve cluster recovery. However, combining these techniques permits a completely new level of performance previously unachieved by k-means algorithm variants, unraveling vastly different cluster structures. These results shed new light on the current status quo between GMM and k-means methods and suggest the more frequent use of general GMMs for data exploration. To facilitate such applications, we provide open-source code as well as Julia packages ("UnsupervisedClustering.jl" and "RegularizedCovarianceMatrices.jl") implementing the proposed techniques.


page 16

page 19

page 21

page 23


Cutoff for exact recovery of Gaussian mixture models

We determine the cutoff value on separation of cluster centers for exact...

Optimal Clustering in Anisotropic Gaussian Mixture Models

We study the clustering task under anisotropic Gaussian Mixture Models w...

Addressing overfitting in spectral clustering via a non-parametric bootstrap

Finite mixture modelling is a popular method in the field of clustering ...

Surrogate modeling approximation using a mixture of experts based on EM joint estimation

An automatic method to combine several local surrogate models is present...

Splitting Methods for Convex Clustering

Clustering is a fundamental problem in many scientific applications. Sta...

Towards the global vision of engagement of Generation Z at the workplace: Mathematical modeling

Correlation and cluster analyses (k-Means, Gaussian Mixture Models) were...

When Do Birds of a Feather Flock Together? K-Means, Proximity, and Conic Programming

Given a set of data, one central goal is to group them into clusters bas...

Please sign up or login with your details

Forgot password? Click here to reset