Estimating and accounting for unobserved covariates in high dimensional correlated data
Many high dimensional and high-throughput biological datasets have complex sample correlation structures, which include longitudinal and multiple tissue data, as well as data with multiple treatment conditions or related individuals. These data, as well as nearly all high-throughput `omic' data, are influenced by technical and biological factors unknown to the researcher, which, if unaccounted for, can severely obfuscate estimation and inference on effects due to the known covariate of interest. We therefore developed CBCV and CorrConf: provably accurate and computationally efficient methods to choose the number of and estimate latent confounding factors present in high dimensional data with correlated or nonexchangeable residuals. We demonstrate each method's superior performance compared to other state of the art methods by analyzing simulated multi-tissue gene expression data and identifying sex-associated DNA methylation sites in a real, longitudinal twin study. As far as we are aware, these are the first methods to estimate the number of and correct for latent confounding factors in data with correlated or nonexchangeable residuals. An R-package is available for download at https://github.com/chrismckennan/CorrConf.
READ FULL TEXT