Optimal Multi-Wave Validation of Secondary Use Data with Outcome and Exposure Misclassification

by   Sarah C. Lotspeich, et al.

The growing availability of observational databases like electronic health records (EHR) provides unprecedented opportunities for secondary use of such data in biomedical research. However, these data can be error-prone and need to be validated before use. It is usually unrealistic to validate the whole database due to resource constraints. A cost-effective alternative is to implement a two-phase design that validates a subset of patient records that are enriched for information about the research question of interest. Herein, we consider odds ratio estimation under differential outcome and exposure misclassification. We propose optimal designs that minimize the variance of the maximum likelihood odds ratio estimator. We develop a novel adaptive grid search algorithm that can locate the optimal design in a computationally feasible and numerically accurate manner. Because the optimal design requires specification of unknown parameters at the outset and thus is unattainable without prior information, we introduce a multi-wave sampling strategy to approximate it in practice. We demonstrate the efficiency gains of the proposed designs over existing ones through extensive simulations and two large observational studies. We provide an R package and Shiny app to facilitate the use of the optimal designs.


page 5

page 8

page 10

page 17

page 18

page 22


Two-phase analysis and study design for survival models with error-prone exposures

Increasingly, medical research is dependent on data collected for non-re...

Selective recruitment designs for improving observational studies using electronic health records

Large scale electronic health records (EHRs) present an opportunity to q...

Optimal multi-wave sampling for regression modelling in two-phase designs

Two-phase designs involve measuring extra variables on a subset of the c...

Improved Generalized Raking Estimators to Address Dependent Covariate and Failure-Time Outcome Error

Biomedical studies that use electronic health records (EHR) data for inf...

An Approximate Quasi-Likelihood Approach for Error-Prone Failure Time Outcomes and Exposures

Measurement error arises commonly in clinical research settings that rel...

Three-phase generalized raking and multiple imputation estimators to address error-prone data

Validation studies are often used to obtain more reliable information in...

Please sign up or login with your details

Forgot password? Click here to reset