Convergence-Divergence MCMC

Stephen Pollard

November 28, 2019

2015-2017

The Convergence-Divergence MCMC is a Markov chain Monte Carlo method used to fit a customized model of the expected distribution of the convergence, divergence, and distance between many different branches on a phylogenetic tree. The definition of a convergence between two branches is when there are two substitutions at the same site on the two branches to the same amino acid. For example, a substitution on branch #403 at site 14 from Alanine to Glutamine and a substitution on branch #586 at site 14 from Methionine to Glutamine would be a single convergence event between branches #403 and #586. If either of the substitutions above were to an amino acid other than Glutamine, the event would count as a divergence. The distance between the branches on a phylogenetic tree is calculated from the top (ancestor) of one branch, back in time to the most recent common ancestor, and down the top (ancestor) of the other branch.

After calculating the convergence, divergence, and distance from all the branch pairs in the mitochondrial phylogenetic tree using all the mitochondrially encoded proteins, the resulting dataset had over 700,000 points each with the three features of convergence (C), divergence (D), and distance. I divided the points by distance into bins then fit a custom statistical model to the convergence and divergence. I modelled each data point as the result of a Bernoulli process (coin flip) with C+D trials and C events. The probability that any double substitution is convergent can be derived from the overall C/D ratio for the window. Therefore the likelihood of the data point with C convergences and C+D total double substitutions in window w can be calculated using the binomial distribution.

Repository: https://publichg.stephentpollard.com/CDMCMC