Sampling Sketches for Concave Sublinear Functions of Frequencies

07/04/2019
by   Edith Cohen, et al.
0

We consider massive distributed datasets that consist of elements modeled as key-value pairs and the task of computing statistics or aggregates where the contribution of each key is weighted by a function of its frequency (sum of values of its elements). This fundamental problem has a wealth of applications in data analytics and machine learning, in particular, with concave sublinear functions of the frequencies that mitigate the disproportionate effect of keys with high frequency. The family of concave sublinear functions includes low frequency moments (p ≤ 1), capping, logarithms, and their compositions. A common approach is to sample keys, ideally, proportionally to their contributions and estimate statistics from the sample. A simple but costly way to do this is by aggregating the data to produce a table of keys and their frequencies, apply our function to the frequency values, and then apply a weighted sampling scheme. Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies. Our sketch structure size is very close to the desired sample size and our samples provide statistical guarantees on the estimation quality that are very close to that of an ideal sample of the same size computed over aggregated data. Finally, we demonstrate experimentally the simplicity and effectiveness of our methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/25/2020

Differentially Private Weighted Sampling

Common datasets have the form of elements with keys (e.g., transactions ...
research
07/14/2020

WOR and p's: Sketches for ℓ_p-Sampling Without Replacement

Weighted sampling is a fundamental tool in data analysis and machine lea...
research
04/09/2020

Composable Sketches for Functions of Frequencies: Beyond the Worst Case

Recently there has been increased interest in using machine learning tec...
research
10/07/2020

New Verification Schemes for Frequency-Based Functions on Data Streams

We study the general problem of computing frequency-based functions, i.e...
research
03/02/2015

Recovering PCA from Hybrid-(ℓ_1,ℓ_2) Sparse Sampling of Data Elements

This paper addresses how well we can recover a data matrix when only giv...
research
05/21/2018

The Adaptive sampling revisited

The problem of estimating the number n of distinct keys of a large colle...
research
06/11/2019

Temporally-Biased Sampling Schemes for Online Model Management

To maintain the accuracy of supervised learning models in the presence o...

Please sign up or login with your details

Forgot password? Click here to reset