Gaussian Mixture Clustering Using Relative Tests of Fit

10/07/2019
by   Purvasha Chakravarti, et al.
11

We consider clustering based on significance tests for Gaussian Mixture Models (GMMs). Our starting point is the SigClust method developed by Liu et al. (2008), which introduces a test based on the k-means objective (with k = 2) to decide whether the data should be split into two clusters. When applied recursively, this test yields a method for hierarchical clustering that is equipped with a significance guarantee. We study the limiting distribution and power of this approach in some examples and show that there are large regions of the parameter space where the power is low. We then introduce a new test based on the idea of relative fit. Unlike prior work, we test for whether a mixture of Gaussians provides a better fit relative to a single Gaussian, without assuming that either model is correct. The proposed test has a simple critical value and provides provable error control. One version of our test provides exact, finite sample control of the type I error. We show how our tests can be used for hierarchical clustering as well as in a sequential manner for model selection. We conclude with an extensive simulation study and a cluster analysis of a gene expression dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2020

Robust mixture regression with Exponential Power distribution

Assuming an exponential power distribution is one way to deal with outli...
research
12/27/2017

The information bottleneck and geometric clustering

The information bottleneck (IB) approach to clustering takes a joint dis...
research
06/17/2022

On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models

A common way to learn and analyze statistical models is to consider oper...
research
03/20/2021

Simple sufficient condition for inadmissibility of Moran's single-split test

Suppose that a statistician observes two independent variates X_1 and X_...
research
03/25/2022

Common Failure Modes of Subcluster-based Sampling in Dirichlet Process Gaussian Mixture Models – and a Deep-learning Solution

The Dirichlet Process Gaussian Mixture Model (DPGMM) is often used to cl...
research
09/28/2021

An exact test for significance of clusters in binary data

Unsupervised clustering of feature matrix data is an indispensible techn...
research
10/27/2018

Informative Features for Model Comparison

Given two candidate models, and a set of target observations, we address...

Please sign up or login with your details

Forgot password? Click here to reset