Scalable and Distributed Clustering via Lightweight Coresets

02/27/2017
by   Olivier Bachem, et al.
0

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct light-weight coresets for k-Means clustering, Bregman clustering and maximum likelihood estimation of Gaussian mixture models. The algorithm is substantially faster than existing constructions, embarrassingly parallel and resulting coresets are smaller. In an extensive experimental evaluation, we demonstrate that the proposed method outperforms existing coreset constructions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/21/2015

Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

Coresets are efficient representations of data sets such that models tra...
research
03/19/2017

Practical Coreset Constructions for Machine Learning

We investigate coresets - succinct, small summaries of large data sets -...
research
11/25/2015

Maximum Likelihood Estimation for Single Linkage Hierarchical Clustering

We derive a statistical model for estimation of a dendrogram from single...
research
03/21/2021

Detecting Label Noise via Leave-One-Out Cross-Validation

We present a simple algorithm for identifying and correcting real-valued...
research
10/31/2017

Nebula: F0 Estimation and Voicing Detection by Modeling the Statistical Properties of Feature Extractors

A F0 and voicing status estimation algorithm for speech analysis/synthes...
research
05/02/2016

Linear-time Outlier Detection via Sensitivity

Outliers are ubiquitous in modern data sets. Distance-based techniques a...
research
05/02/2016

Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning

Faced with massive data, is it possible to trade off (statistical) risk,...

Please sign up or login with your details

Forgot password? Click here to reset