Correlation Sketches for Approximate Join-Correlation Queries

by   Aécio Santos, et al.

The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column Q and a join column K_Q from a query table 𝒯_Q, retrieve tables 𝒯_X in a dataset collection such that 𝒯_X is joinable with 𝒯_Q on K_Q and there is a column C ∈𝒯_X such that Q is correlated with C. A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between Q and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.


page 1

page 2

page 3

page 4


Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

Finding joinable tables in data lakes is key procedure in many applicati...

FEDEX: An Explainability Framework for Data Exploration Steps

When exploring a new dataset, Data Scientists often apply analysis queri...

Calibration: A Simple Trick for Wide-table Delta Analytics

Data analytics over normalized databases typically requires computing an...

Selection of BJI configuration: Approach based on minimal transversals

Decision systems deal with a large volume of data stored in new database...

Model Joins: Enabling Analytics Over Joins of Absent Big Tables

This work is motivated by two key facts. First, it is highly desirable t...

DataVizard: Recommending Visual Presentations for Structured Data

Selecting the appropriate visual presentation of the data such that it p...

PushdownDB: Accelerating a DBMS using S3 Computation

This paper studies the effectiveness of pushing parts of DBMS analytics ...

Please sign up or login with your details

Forgot password? Click here to reset