Discovering Multi-Table Functional Dependencies Without Full Join Computation

12/11/2020
by   Ugo Comignani, et al.
0

In this paper, we study the problem of discovering join FDs, i.e., functional dependencies (FDs) that hold on multiple joined tables. We leverage logical inference, selective mining, and sampling and show that we can discover most of the exact join FDs from the single tables participating to the join and avoid the full computation of the join result. We propose algorithms to speed-up the join FD discovery process and mine FDs on the fly only from necessary data partitions. We introduce JEDI (Join dEpendency DIscovery), our solution to discover join FDs without computation of the full join beforehand. Our experiments on a range of real-world and synthetic data demonstrate the benefits of our method over existing FD discovery methods that need to precompute the join results before discovering the FDs. We show that the performance depends on the cardinalities and coverage of the join attribute values: for join operations with low coverage, JEDI with selective mining outperforms the competing methods using the straightforward approach of full join computation by one order of magnitude in terms of runtime and can discover three-quarters of the exact join FDs using mainly logical inference in half of its total execution time on average. For higher join coverage, JEDI with sampling reaches precision of 1 with only 63 average.

READ FULL TEXT

page 2

page 10

research
10/01/2021

MATE: Multi-Attribute Table Extraction

A core operation in data discovery is to find joinable tables for a give...
research
07/01/2023

Aggregation Consistency Errors in Semantic Layers and How to Avoid Them

Analysts often struggle with analyzing data from multiple tables in a da...
research
05/31/2023

Measuring and Predicting the Quality of a Join for Data Discovery

We study the problem of discovering joinable datasets at scale. We appro...
research
11/18/2021

Efficiently Transforming Tables for Joinability

Data from different sources rarely conform to a single formatting even i...
research
12/01/2020

Scalable Data Discovery Using Profiles

We study the problem of discovering joinable datasets at scale. This is,...
research
06/10/2022

Density-optimized Intersection-free Mapping and Matrix Multiplication for Join-Project Operations (extended version)

A Join-Project operation is a join operation followed by a duplicate eli...
research
06/03/2021

Niffler: A Reference Architecture and System Implementation for View Discovery over Pathless Table Collections by Example

Identifying a project-join view (PJ-view) over collections of tables is ...

Please sign up or login with your details

Forgot password? Click here to reset