A Random Sample Partition Data Model for Big Data Analysis
Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) to represent a big data set as a set of non-overlapping data subsets, i.e. RSP data blocks, where each RSP data block has the same probability distribution with the whole big data set. Then, the block-based sampling is used to directly select representative samples for a variety of data analysis tasks. We show how RSP data blocks can be employed to estimate statistics and build models which are equivalent (or approximate) to those from the whole big data set.
READ FULL TEXT