ML Based Lineage in Databases

by   Michael Leybovich, et al.

We track the lineage of tuples throughout their database lifetime. That is, we consider a scenario in which tuples (records) that are produced by a query may affect other tuple insertions into the DB, as part of a normal workflow. As time goes on, exact provenance explanations for such tuples become deeply nested, increasingly consuming space, and resulting in decreased clarity and readability. We present a novel approach for approximating lineage tracking, using a Machine Learning (ML) and Natural Language Processing (NLP) technique; namely, word embedding. The basic idea is summarizing (and approximating) the lineage of each tuple via a small set of constant-size vectors (the number of vectors per-tuple is a hyperparameter). Therefore, our solution does not suffer from space complexity blow-up over time, and it "naturally ranks" explanations to the existence of a tuple. We devise an alternative and improved lineage tracking mechanism, that of keeping track of and querying lineage at the column level; thereby, we manage to better distinguish between the provenance features and the textual characteristics of a tuple. We integrate our lineage computations into the PostgreSQL system via an extension (ProvSQL) and extensive experiments exhibit useful results in terms of accuracy against exact, semiring-based, justifications, especially for the column-based (CV) method which exhibits high precision and high per-level recall. In the experiments, we focus on tuples with multiple generations of tuples in their lifelong lineage and analyze them in terms of direct and distant lineage.


page 11

page 12


RETRO: Relation Retrofitting For In-Database Machine Learning on Textual Data

There are massive amounts of textual data residing in databases, valuabl...

Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning

Machine learning (ML) requires using energy to carry out computations du...

A general approach to compute the relevance of middle-level input features

This work proposes a novel general framework, in the context of eXplaina...

Abduction-Based Explanations for Machine Learning Models

The growing range of applications of Machine Learning (ML) in a multitud...

Enabling Cognitive Intelligence Queries in Relational Databases using Low-dimensional Word Embeddings

We apply distributed language embedding methods from Natural Language Pr...

Maliva: Using Machine Learning to Rewrite Visualization Queries Under Time Constraints

We consider data-visualization systems where a middleware layer translat...

Please sign up or login with your details

Forgot password? Click here to reset