Discovering Reliable Correlations in Categorical Data

08/30/2019
by   Panagiotis Mandros, et al.
20

In many scientific tasks we are interested in discovering whether there exist any correlations in our data. This raises many questions, such as how to reliably and interpretably measure correlation between a multivariate set of attributes, how to do so without having to make assumptions on distribution of the data or the type of correlation, and, how to efficiently discover the top-most reliably correlated attribute sets from data. In this paper we answer these questions for discovery tasks in categorical data. In particular, we propose a corrected-for-chance, consistent, and efficient estimator for normalized total correlation, by which we obtain a reliable, naturally interpretable, non-parametric measure for correlation over multivariate sets. For the discovery of the top-k correlated sets, we derive an effective algorithmic framework based on a tight bounding function. This framework offers exact, approximate, and heuristic search. Empirical evaluation shows that already for small sample sizes the estimator leads to low-regret optimization outcomes, while the algorithms are shown to be highly effective for both large and high-dimensional data. Through two case studies we confirm that our discovery framework identifies interesting and meaningful correlations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2017

Discovering Reliable Approximate Functional Dependencies

Given a database and a target attribute of interest, how can we tell whe...
research
10/28/2015

Universal Dependency Analysis

Most data is multi-dimensional. Discovering whether any subset of dimens...
research
09/12/2017

Discovering Potential Correlations via Hypercontractivity

Discovering a correlation from one variable to another variable is of fu...
research
09/22/2017

Efficiently Discovering Locally Exceptional yet Globally Representative Subgroups

Subgroup discovery is a local pattern mining technique to find interpret...
research
06/04/2014

Discovering Structure in High-Dimensional Data Through Correlation Explanation

We introduce a method to learn a hierarchy of successively more abstract...
research
11/18/2018

A Tracy-Widom Empirical Estimator For Valid P-values With High-Dimensional Datasets

Recent technological advances in many domains including both genomics an...
research
01/26/2017

Identifying Consistent Statements about Numerical Data with Dispersion-Corrected Subgroup Discovery

Existing algorithms for subgroup discovery with numerical targets do not...

Please sign up or login with your details

Forgot password? Click here to reset