Modelling phylogeny in 16S rRNA gene sequencing datasets using string kernels
Motivation: Bacterial community composition is commonly quantified using 16S rRNA (ribosomal ribonucleic acid) gene sequencing. One of the defining characteristics of these datasets is the phylogenetic relationships that exist between variables. Here, we demonstrate the utility of modelling phylogenetic relationships in two tasks (the two sample test and host trait prediction) using a novel application of string kernels. Results: We show via simulation studies that a kernel two-sample test using string kernels is sensitive to the phylogenetic scale of the difference between the two populations and is more powerful than tests using kernels based on popular microbial distance metrics. We also demonstrate how Gaussian process modelling can be used to infer the distribution of bacterial-host effects across the phylogenetic tree using simulations and two real host trait prediction tasks.
READ FULL TEXT