Synthesize Extremely High-dimensional Longitudinal Electronic Health Records via Hierarchical Autoregressive Language Model

by   Brandon Theodorou, et al.

Synthetic electronic health records (EHRs) that are both realistic and preserve privacy can serve as an alternative to real EHRs for machine learning (ML) modeling and statistical analysis. However, generating high-fidelity and granular electronic health record (EHR) data in its original, highly-dimensional form poses challenges for existing methods due to the complexities inherent in high-dimensional data. In this paper, we propose Hierarchical Autoregressive Language mOdel (HALO) for generating longitudinal high-dimensional EHR, which preserve the statistical properties of real EHR and can be used to train accurate ML models without privacy concerns. Our HALO method, designed as a hierarchical autoregressive model, generates a probability density function of medical codes, clinical visits, and patient records, allowing for the generation of realistic EHR data in its original, unaggregated form without the need for variable selection or aggregation. Additionally, our model also produces high-quality continuous variables in a longitudinal and probabilistic manner. We conducted extensive experiments and demonstrate that HALO can generate high-fidelity EHR data with high-dimensional disease code probabilities (d > 10,000), disease co-occurrence probabilities within visits (d > 1,000,000), and conditional probabilities across consecutive visits (d > 5,000,000) and achieve above 0.9 R2 correlation in comparison to real EHR data. This performance then enables downstream ML models trained on its synthetic data to achieve comparable accuracy to models trained on real data (0.938 AUROC with HALO data vs. 0.943 with real data). Finally, using a combination of real and synthetic data enhances the accuracy of ML models beyond that achieved by using only real EHR data.


page 1

page 2

page 3

page 4


EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders

Researchers require timely access to real-world longitudinal electronic ...

STAN: Synthetic Network Traffic Generation using Autoregressive Neural Models

Deep learning models have achieved great success in recent years. Howeve...

MedDiff: Generating Electronic Health Records using Accelerated Denoising Diffusion Model

Due to patient privacy protection concerns, machine learning research in...

Generating Multi-label Discrete Patient Records using Generative Adversarial Networks

Access to electronic health record (EHR) data has motivated computationa...

Generation of Synthetic Electronic Medical Record Text

Machine learning (ML) and Natural Language Processing (NLP) have achieve...

Fidelity and Privacy of Synthetic Medical Data

The digitization of medical records ushered in a new era of big data to ...

Hide-and-Seek Privacy Challenge

The clinical time-series setting poses a unique combination of challenge...

Please sign up or login with your details

Forgot password? Click here to reset