On Optimizing the Trade-off between Privacy and Utility in Data Provenance

by   Daniel Deutch, et al.

Organizations that collect and analyze data may wish or be mandated by regulation to justify and explain their analysis results. At the same time, the logic that they have followed to analyze the data, i.e., their queries, may be proprietary and confidential. Data provenance, a record of the transformations that data underwent, was extensively studied as means of explanations. In contrast, only a few works have studied the tension between disclosing provenance and hiding the underlying query. This tension is the focus of the present paper, where we formalize and explore for the first time the tradeoff between the utility of presenting provenance information and the breach of privacy it poses with respect to the underlying query. Intuitively, our formalization is based on the notion of provenance abstraction, where the representation of some tuples in the provenance expressions is abstracted in a way that makes multiple tuples indistinguishable. The privacy of a chosen abstraction is then measured based on how many queries match the obfuscated provenance, in the same vein as k-anonymity. The utility is measured based on the entropy of the abstraction, intuitively how much information is lost with respect to the actual tuples participating in the provenance. Our formalization yields a novel optimization problem of choosing the best abstraction in terms of this tradeoff. We show that the problem is intractable in general, but design greedy heuristics that exploit the provenance structure towards a practically efficient exploration of the search space. We experimentally prove the effectiveness of our solution using the TPC-H benchmark and the IMDB dataset.


page 1

page 2

page 3

page 4


A Practical Approach to Navigating the Tradeoff Between Privacy and Precise Utility

Due to the recent popularity of online social networks, coupled with peo...

Hypothetical Reasoning via Provenance Abstraction

Data analytics often involves hypothetical reasoning: repeatedly modifyi...

Geo-Graph-Indistinguishability: Location Privacy on Road Networks Based on Differential Privacy

In recent years, concerns about location privacy are increasing with the...

On the Utility Recovery Incapability of Neural Net-based Differential Private Tabular Training Data Synthesizer under Privacy Deregulation

Devising procedures for auditing generative model privacy-utility tradeo...

Increasing Transparent and Accountable Use of Data by Quantifying the Actual Privacy Risk in Interactive Record Linkage

Record linkage refers to the task of integrating data from two or more d...

Privacy Preserving Controller Synthesis via Belief Abstraction

Privacy is a crucial concern in many systems in addition to their given ...

Local Utility Elicitation in GAI Models

Structured utility models are essential for the effective representation...

Please sign up or login with your details

Forgot password? Click here to reset