Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal

04/14/2020
by   Lingxiao Huang, et al.
0

Given a collection of n points in ℝ^d, the goal of the (k,z)-clustering problem is to find a subset of k "centers" that minimizes the sum of the z-th powers of the Euclidean distance of each point to the closest center. Special cases of the (k,z)-clustering problem include the k-median and k-means problems. Our main result is a unified two-stage importance sampling framework that constructs an ε-coreset for the (k,z)-clustering problem. Compared to the results for (k,z)-clustering in [Feldman and Langberg, STOC 2011], our framework saves a ε^2 d factor in the coreset size. Compared to the results for (k,z)-clustering in [Sohler and Woodruff, FOCS 2018], our framework saves a poly(k) factor in the coreset size and avoids the (k/ε) term in the construction time. Specifically, our coreset for k-median (z=1) has size Õ(ε^-4 k) which, when compared to the result in [Sohler and Woodruff, STOC 2018], saves a k factor in the coreset size. Our algorithmic results rely on a new dimensionality reduction technique that connects two well-known shape fitting problems: subspace approximation and clustering, and may be of independent interest. We also provide a size lower bound of Ω(k·min{2^z/20,d }) for a 0.01-coreset for (k,z)-clustering, which has a linear dependence of size on k and an exponential dependence on z that matches our algorithmic results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/25/2022

Towards Optimal Lower Bounds for k-median and k-means Coresets

Given a set of points in a metric space, the (k,z)-clustering problem co...
research
09/05/2022

The Power of Uniform Sampling for Coresets

Motivated by practical generalizations of the classic k-median and k-mea...
research
07/10/2019

Coresets for Clustering in Graphs of Bounded Treewidth

We initiate the study of coresets for clustering in graph metrics, i.e.,...
research
02/28/2019

Probabilistic smallest enclosing ball in high dimensions via subgradient sampling

We study a variant of the median problem for a collection of point sets ...
research
09/09/2018

Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension

We obtain the first strong coresets for the k-median and subspace approx...
research
06/30/2021

Coresets for Clustering with Missing Values

We provide the first coreset for clustering points in ℝ^d that have mult...
research
02/01/2018

Sensitivity Sampling Over Dynamic Geometric Data Streams with Applications to k-Clustering

Sensitivity based sampling is crucial for constructing nearly-optimal co...

Please sign up or login with your details

Forgot password? Click here to reset