WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses

by   Tianji Cong, et al.

Data discovery is a major challenge in enterprise data analysis: users often struggle to find data relevant to their analysis goals or even to navigate through data across data sources, each of which may easily contain thousands of tables. One common user need is to discover tables joinable with a given table. This need is particularly critical because join is a ubiquitous operation in data analysis, and join paths are mostly obscure to users, especially across databases. Furthermore, users are typically interested in finding “semantically” joinable tables: with columns that can be transformed to become joinable even if they are not joinable as currently represented in the data store. We present WarpGate, a system prototype for data discovery over cloud data warehouses. WarpGate implements an embedding-based solution to semantic join discovery, which encodes columns into high-dimensional vector space such that joinable columns map to points that are near each other. Through experiments on several table corpora, we show that WarpGate (i) captures semantic relationships between tables, especially those across databases, and (ii) is sample efficient and thus scalable to very large tables of millions of rows. We also showcase an application of WarpGate within an enterprise product for cloud data analytics.


Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach

Finding joinable tables in data lakes is key procedure in many applicati...

Model Joins: Enabling Analytics Over Joins of Absent Big Tables

This work is motivated by two key facts. First, it is highly desirable t...

Pylon: Semantic Table Union Search in Data Lakes

The large size and fast growth of data repositories, such as data lakes,...

DIALITE: Discover, Align and Integrate Open Data Tables

We demonstrate a novel table discovery pipeline called DIALITE that allo...

Termite: A System for Tunneling Through Heterogeneous Data

Data-driven analysis is important in virtually every modern organization...

Tab2KG: Semantic Table Interpretation with Lightweight Semantic Profiles

Tabular data plays an essential role in many data analytics and machine ...

Dataset Discovery in Data Lakes

Data analytics stands to benefit from the increasing availability of dat...

Please sign up or login with your details

Forgot password? Click here to reset