Topical Hidden Genome: Discovering Latent Cancer Mutational Topics using a Bayesian Multilevel Context-learning Approach

12/30/2022
by   Saptarshi Chakraborty, et al.
0

Statistical inference on the cancer-site specificities of collective ultra-rare whole genome somatic mutations is an open problem. Traditional statistical methods cannot handle whole-genome mutation data due to their ultra-high-dimensionality and extreme data sparsity – e.g., >30 million unique variants are observed in the  1700 whole-genome tumor dataset considered herein, of which >99 information in these rare variants we have recently proposed the "hidden genome model", a formal multilevel multi-logistic model that mines information in ultra-rare somatic variants to characterize tumor types. The model condenses signals in rare variants through a hierarchical layer leveraging contexts of individual mutations. The model is currently implemented using consistent, scalable point estimation techniques that can handle 10s of millions of variants detected across thousands of tumors. Our recent publications have evidenced its impressive accuracy and attributability at scale. However, principled statistical inference from the model is infeasible due to the volume, correlation, and non-interpretability of the mutation contexts. In this paper we propose a novel framework that leverages topic models from the field of computational linguistics to induce an *interpretable dimension reduction* of the mutation contexts used in the model. The proposed model is implemented using an efficient MCMC algorithm that permits rigorous full Bayesian inference at a scale that is orders of magnitude beyond the capability of out-of-the-box high-dimensional multi-class regression methods and software. We employ our model on the Pan Cancer Analysis of Whole Genomes (PCAWG) dataset, and our results reveal interesting novel insights.

READ FULL TEXT

page 13

page 35

research
05/21/2020

Using the "Hidden" Genome to Improve Classification of Cancer Types

It is increasingly common clinically for cancer specimens to be examined...
research
10/10/2017

Prior Knowledge based mutation prioritization towards causal variant finding in rare disease

How do we determine the mutational effects in exome sequencing data with...
research
05/08/2018

Hierarchical inference for genome-wide association studies: a view on methodology with software

We provide a view on high-dimensional statistical inference for genome-w...
research
05/08/2020

The scalable Birth-Death MCMC Algorithm for Mixed Graphical Model Learning with Application to Genomic Data Integration

Recent advances in biological research have seen the emergence of high-t...
research
05/30/2018

Recurrent Deep Embedding Networks for Genotype Clustering and Ethnicity Prediction

The understanding of variations in genome sequences assists us in identi...
research
02/20/2022

Pairwise Nonlinear Dependence Analysis of Genomic Data

In The Cancer Genome Atlas (TCGA) dataset, there are many interesting no...
research
04/14/2016

Variational inference for rare variant detection in deep, heterogeneous next-generation sequencing data

The detection of rare variants is important for understanding the geneti...

Please sign up or login with your details

Forgot password? Click here to reset