Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

09/12/2017
by   Daniel Ting, et al.
0

We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the former, the sketch provides unbiased estimates with state of the art accuracy. It handles the challenging scenario when the data is disaggregated so that computing the per unit metric of interest requires an expensive aggregation. For example, the metric of interest may be total clicks per user while the raw data is a click stream with multiple rows per user. Thus the sketch is suitable for use in a wide range of applications including computing historical click through rates for ad prediction, reporting user metrics from event streams, and measuring network traffic for IP flows. We prove and empirically show the sketch has good properties for both the disaggregated subset sum estimation and frequent item problems. On i.i.d. data, it not only picks out the frequent items but gives strongly consistent estimates for the proportion of each frequent item. The resulting sketch asymptotically draws a probability proportional to size sample that is optimal for estimating sums over the data. For non i.i.d. data, we show that it typically does much better than random sampling for the frequent item problem and never does worse. For subset sum estimation, we show that even for pathological sequences, the variance is close to that of an optimal sampling design. Empirically, despite the disadvantage of operating on disaggregated data, our method matches or bests priority sampling, a state of the art method for pre-aggregated data and performs orders of magnitude better on skewed data compared to uniform sampling. We propose extensions to the sketch that allow it to be used in combining multiple data sets, in distributed systems, and for time decayed aggregation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/05/2007

On the variance of subset sum estimation

For high volume data streams and large data warehouses, sampling is used...
research
01/06/2022

SQUAD: Combining Sketching and Sampling Is Better than Either for Per-item Quantile Estimation

Stream monitoring is fundamental in many data stream applications, such ...
research
06/21/2023

PrivSketch: A Private Sketch-based Frequency Estimation Protocol for Data Streams

Local differential privacy (LDP) has recently become a popular privacy-p...
research
12/07/2021

SpaceSaving^±: An Optimal Algorithm for Frequency Estimation and Frequent items in the Bounded Deletion Model

In this paper, we propose the first deterministic algorithms to solve th...
research
10/31/2022

Local Differentially Private Frequency Estimation based on Learned Sketches

Sketches are widely used for frequency estimation of data with a large d...
research
05/24/2020

HyperLogLog Sketch Acceleration on FPGA

Data sketches are a set of widely used approximated data summarizing tec...
research
06/11/2022

Sampling-based Estimation of the Number of Distinct Values in Distributed Environment

In data mining, estimating the number of distinct values (NDV) is a fund...

Please sign up or login with your details

Forgot password? Click here to reset