Detecting Data Errors with Statistical Constraints

02/26/2019
by   Jing Nathan Yan, et al.
0

A powerful approach to detecting erroneous data is to check which potentially dirty data records are incompatible with a user's domain knowledge. Previous approaches allow the user to specify domain knowledge in the form of logical constraints (e.g., functional dependency and denial constraints). We extend the constraint-based approach by introducing a novel class of statistical constraints (SCs). An SC treats each column as a random variable, and enforces an independence or dependence relationship between two (or a few) random variables. Statistical constraints are expressive, allowing the user to specify a wide range of domain knowledge, beyond traditional integrity constraints. Furthermore, they work harmoniously with downstream statistical modeling. We develop CODED, an SC-Oriented Data Error Detection system that supports three key tasks: (1) Checking whether an SC is violated or not on a given dataset, (2) Identify the top-k records that contribute the most to the violation of an SC, and (3) Checking whether a set of input SCs have conflicts or not. We present effective solutions for each task. Experiments on synthetic and real-world data illustrate how SCs apply to error detection, and provide evidence that CODED performs better than state-of-the-art approaches.

READ FULL TEXT

page 4

page 9

page 11

page 12

page 15

page 16

page 17

page 18

research
05/09/2012

Domain Knowledge Uncertainty and Probabilistic Parameter Constraints

Incorporating domain knowledge into the modeling process is an effective...
research
08/17/2022

Domain Knowledge in A*-Based Causal Discovery

Causal discovery has become a vital tool for scientists and practitioner...
research
04/06/2022

Style-Hallucinated Dual Consistency Learning for Domain Generalized Semantic Segmentation

In this paper, we study the task of synthetic-to-real domain generalized...
research
11/02/2021

MultiplexNet: Towards Fully Satisfied Logical Constraints in Neural Networks

We propose a novel way to incorporate expert knowledge into the training...
research
10/12/2017

Sign-Constrained Regularized Loss Minimization

In practical analysis, domain knowledge about analysis target has often ...
research
03/20/2017

Copula Index for Detecting Dependence and Monotonicity between Stochastic Signals

This paper introduces a nonparametric copula-based approach for detectin...
research
12/20/2019

Data Validation Infrastructure for R

Checking data quality against domain knowledge is a common activity that...

Please sign up or login with your details

Forgot password? Click here to reset