SANTOS: Relationship-based Semantic Table Union Search

by   Aamod Khatiwada, et al.

Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns. The first uses an existing knowledge base (KB), the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search.


page 1

page 2

page 3

page 4


Generative Benchmark Creation for Table Union Search

Data management has traditionally relied on synthetic data generators to...

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

Dataset discovery from data lakes is essential in many real application ...

Optimizing Organizations for Navigating Data Lakes

Navigation is known to be an effective complement to search. In addition...

Pylon: Semantic Table Union Search in Data Lakes

The large size and fast growth of data repositories, such as data lakes,...

Learning Semantic Annotations for Tabular Data

The usefulness of tabular data such as web tables critically depends on ...

Column Type Annotation using ChatGPT

Column type annotation is the task of annotating the columns of a relati...

DeepJoin: Joinable Table Discovery with Pre-trained Language Models

Due to the usefulness in data enrichment for data analysis tasks, joinab...

Please sign up or login with your details

Forgot password? Click here to reset