The Tajima heterochronous n-coalescent: inference from heterochronously sampled molecular data

04/14/2020
by   Lorenzo Cappello, et al.
0

The observed sequence variation at a locus informs about the evolutionary history of the sample and past population size dynamics. The standard Kingman coalescent model on genealogies - timed trees that represent the ancestry of the sample - is used in a generative model of molecular sequence variation to infer evolutionary parameters. However, the state space of Kingman's genealogies grows superexponentially with sample size n, making inference computationally unfeasible already for small n. We introduce a new coalescent model called Tajima heterochronous n-coalescent with a substantially smaller cardinality of the genealogical space. This process allows to analyze samples collected at different times, a situation that in applications is both met (e.g. ancient DNA and RNA from rapidly evolving pathogens like viruses) and statistically desirable (variance reduction and parameter identifiability). We propose an algorithm to calculate the likelihood efficiently and present a Bayesian nonparametric procedure to infer the population size trajectory. We provide a new MCMC sampler to explore the space of Tajima's genealogies and model parameters. We compare our procedure with state-of-the-art methodologies in simulations and applications. We use our method to re-examine the scientific question of how Beringian bison went extinct analyzing modern and ancient molecular sequences of bison in North America, and to reconstruct population size trajectory of SARS-CoV-2 from viral sequences collected in France and Germany.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset