Privacy-Preserving Linkage of Distributed Datasets using the Personal Health Train

by   Maximilian Jugl, et al.

With the generation of personal and medical data at several locations, medical data science faces unique challenges when working on distributed datasets. Growing data protection requirements in recent years drastically limit the use of personally identifiable information. Distributed data analysis aims to provide solutions for securely working on highly sensitive data while minimizing the risk of information leaks, which would not be possible to the same degree in a centralized approach. A novel concept in this field is the Personal Health Train (PHT), which encapsulates the idea of bringing the analysis to the data, not vice versa. Data sources are represented as train stations. Trains containing analysis tasks move between stations and aggregate results. Train executions are coordinated by a central station which data analysts can interact with. Data remains at their respective stations and analysis results are only stored inside the train, providing a safe and secure environment for distributed data analysis. Duplicate records across multiple locations can skew results in a distributed data analysis. On the other hand, merging information from several datasets referring to the same real-world entities may improve data completeness and therefore data quality. In this paper, we present an approach for record linkage on distributed datasets using the Personal Health Train. We verify this approach and evaluate its effectiveness by applying it to two datasets based on real-world data and outline its possible applications in the context of distributed data analysis tasks.


page 1

page 2

page 3

page 4


Bringing the Algorithms to the Data – Secure Distributed Medical Analytics using the Personal Health Train (PHT-meDIC)

The need for data privacy and security – enforced through increasingly s...

Leveraging Technology for Healthcare and Retaining Access to Personal Health Data to Enhance Personal Health and Well-being

Health data is a sensitive category of personal data. It might result in...

Privacy-Preserving Multiparty Learning For Logistic Regression

In recent years, machine learning techniques are widely used in numerous...

Report: State of the Art Solutions for Privacy Preserving Machine Learning in the Medical Context

Machine Learning on Big Data gets more and more attention in various fie...

FakeSafe: Human Level Data Protection by Disinformation Mapping using Cycle-consistent Adversarial Network

The concept of disinformation is to use fake messages to confuse people ...

Obsolete Personal Information Update System for the Prevention of Falls among Elderly Patients

Falls are a common problem affecting the older adults and a major public...

Diagnosing Distributed Systems through Log Data Analysis

The log-based analysis and trouble-shooting has remained prevalent and c...

Please sign up or login with your details

Forgot password? Click here to reset