Mining CFD Rules on Big Data

08/05/2018
by   Hongzhi Wang, et al.
0

Current conditional functional dependencies (CFDs) discovery algorithms always need a well-prepared training data set. This makes them difficult to be applied on large datasets which are always in low-quality. To handle the volume issue of big data, we develop the sampling algorithms to obtain a small representative training set. For the low-quality issue of big data, we then design the fault-tolerant rule discovery algorithm and the conflict resolution algorithm. We also propose parameter selection strategy for CFD discovery algorithm to ensure its effectiveness. Experimental results demonstrate that our method could discover effective CFD rules on billion-tuple data within reasonable time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2017

A Random Sample Partition Data Model for Big Data Analysis

Big data sets must be carefully partitioned into statistically similar d...
research
09/21/2019

Automatic Weighted Matching Rectifying Rule Discovery for Data Repairing

Data repairing is a key problem in data cleaning which aims to uncover a...
research
05/08/2020

A Survey on Sampling and Profiling over Big Data (Technical Report)

Due to the development of internet technology and computer science, data...
research
06/06/2023

A Calibrated Data-Driven Approach for Small Area Estimation using Big Data

Where the response variable in a big data set is consistent with the var...
research
05/10/2018

Scaling associative classification for very large datasets

Supervised learning algorithms are nowadays successfully scaling up to d...
research
09/10/2021

How Can Subgroup Discovery Help AIOps?

The genuine supervision of modern IT systems brings new challenges as it...
research
07/13/2021

Querying Linked Data: how to ensure user's quality requirements

In the distributed and dynamic framework of the Web, data quality is a b...

Please sign up or login with your details

Forgot password? Click here to reset