Flexible Models for Microclustering with Application to Entity Resolution

10/31/2016
by   Giacomo Zanella, et al.
0

Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some applications, this assumption is inappropriate. For example, when performing entity resolution, the size of each cluster should be unrelated to the size of the data set, and each cluster should contain a negligible fraction of the total number of data points. These applications require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the microclustering property and introducing a new class of models that can exhibit this property. We compare models within this class to two commonly used clustering models using four entity-resolution data sets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2015

Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Most generative models for clustering implicitly assume that the number ...
research
04/04/2020

Random Partition Models for Microclustering Tasks

Traditional Bayesian random partition models assume that the size of eac...
research
03/14/2017

A Random Finite Set Model for Data Clustering

The goal of data clustering is to partition data points into groups to m...
research
09/06/2022

Fast Generation of Exchangeable Sequence of Clusters Data

Recent advances in Bayesian models for random partitions have led to the...
research
10/05/2017

Reliable Learning of Bernoulli Mixture Models

In this paper, we have derived a set of sufficient conditions for reliab...
research
07/15/2020

Mixture Complexity and Its Application to Gradual Clustering Change Detection

In model-based clustering using finite mixture models, it is a significa...
research
09/23/2016

Fast Learning of Clusters and Topics via Sparse Posteriors

Mixture models and topic models generate each observation from a single ...

Please sign up or login with your details

Forgot password? Click here to reset