Bayesian Approaches for Flexible and Informative Clustering of Microbiome Data

by   Yushu Shi, et al.

We propose two unsupervised clustering methods that are designed for human microbiome data. Existing clustering approaches do not fully address the challenges of microbiome data, which are typically structured as counts with a fixed sum constraint. In addition to accounting for this structure, we recognize that high-dimensional microbiome datasets often contain uninformative features, or "noise" operational taxonomic units (OTUs), that hinder successful clustering. To address this challenge, we select features which are useful in differentiating groups during the clustering process. By taking a Bayesian modeling approach, we are able to learn the number of clusters from the data, rather than fixing it upfront. We first describe a basic version of the model using Dirichlet multinomial distributions as mixture components which does not require any additional information on the OTUs. When phylogenetic or taxonomic information is available, however, we rely on Dirichlet tree multinomial distributions, which capture the tree-based topological structure of microbiome data. We test the performance of our methods through simulation, and illustrate their application first to gut microbiome data of children from different regions of the world, and then to a clinical study exploring differences in the microbiome between long and short term pancreatic cancer survivors. Our results demonstrate that the proposed methods have performance advantages over commonly used unsupervised clustering algorithms and the additional scientific benefit of identifying informative features.


page 9

page 11

page 12

page 17

page 18

page 20

page 21

page 22


Dirichlet-tree multinomial mixtures for clustering microbiome compositions

A common routine in microbiome research is to identify reproducible patt...

A Random Finite Set Model for Data Clustering

The goal of data clustering is to partition data points into groups to m...

Probabilistic Clustering of Time-Evolving Distance Data

We present a novel probabilistic clustering model for objects that are r...

Bayesian Hierarchical Mixture Clustering using Multilevel Hierarchical Dirichlet Processes

This paper focuses on the problem of hierarchical non-overlapping cluste...

Bayesian Distance Clustering

Model-based clustering is widely-used in a variety of application areas....

A Bayesian non-parametric method for clustering high-dimensional binary data

In many real life problems, objects are described by large number of bin...

Bayesian outcome-guided multi-view mixture models with applications in molecular precision medicine

Clustering is commonly performed as an initial analysis step for uncover...

Please sign up or login with your details

Forgot password? Click here to reset