Optimizing Organizations for Navigating Data Lakes

by   Fatemeh Nargesian, et al.

Navigation is known to be an effective complement to search. In addition to data discovery, navigation can help users develop a conceptual model of what types of data are available. In data lakes, there has been considerable research on dataset or table discovery using search. We consider the complementary problem of creating an effective navigation structure over a data lake. We define an organization as a navigation structure (graph) containing nodes representing sets of attributes (from tables or from semi-structured documents) within a data lake. An edge represents a subset relationship. We propose a novel problem, the data lake organization problem where the goal is to find an organization that allows a user to most efficiently find attributes or tables. We present a new probabilistic model of how users interact with an organization and define the likelihood of a user finding an attribute or a table using the organization. Our approach uses the attribute values and metadata (when available). For data lakes with little or no metadata, we propose a way of creating metadata using metadata available in other lakes. We propose an approximate algorithm for the organization problem and show its effectiveness on a synthetic benchmark. Finally, we construct an organization on tables of a real data lake containing data from federal Open Data portals and show that the organization dramatically improves the expected probability of discovering tables over a baseline. Using a second real data lake with no metadata, we show how metadata can be inferred that is effective in enabling organization creation.


page 1

page 2

page 3

page 4


DIALITE: Discover, Align and Integrate Open Data Tables

We demonstrate a novel table discovery pipeline called DIALITE that allo...

SANTOS: Relationship-based Semantic Table Union Search

Existing techniques for unionable table search define unionability using...

Relation Extraction from Tables using Artificially Generated Metadata

Relation Extraction (RE) from tables is the task of identifying relation...

rtables – A Framework For Creating Complex Structured Reporting Tables Via Multi-Level Faceted Computations

Tables form a central component in both exploratory data analysis and fo...

Cross Modal Data Discovery over Structured and Unstructured Data Lakes

Organizations are collecting increasingly large amounts of data for data...

Data Leaves: Scenario-oriented Metadata for Data Federative Innovation

A method for representing the digest information of each dataset is prop...

Dataset Discovery in Data Lakes

Data analytics stands to benefit from the increasing availability of dat...

Please sign up or login with your details

Forgot password? Click here to reset