Note to reader: If you are using these materials and images for your class or find any mistakes please contact Rosana Zenil-Ferguson at roszenil(at)uky(dot)edu. Attribution is appreciated.

Modeling macroevolution of a discrete trait

In this workshop we will focus on modeling and inference of discrete trait macroevolution. We will do this using a very famous type of stochastic model called Markov Chain. Markov chains are really popular tools for modelers because they follow a random process over time. In macroevolutionary literature some of the most popular models include Brownian motion (BM) or Onstein-Uhlenbeck (OU) that are Markov chains that follow a continuous trait over continuous time. Those two very popular models have essential mathematical properties that allow us to estimate evolutionary rates (change over time) using the phylogenetic distances.

For discrete states, the story is different. We use also Markov chains, that we have branded as Mkn models but we cannot easily include phylogenetic distances in the same manner a BM or OU model does. As we will discuss in the following sections, for every branch we have to calculate a probability matrix, which makes them computationally difficult.

For all the examples we will discuss in this workshop we are assuming a phylogenetic tree whose tips are discrete states of a trait (Fig. 1). Before jumping into the computational sections, I will start with some formal notation, and enumeration of the nodes. The nodes are enumerated following the Newick format in R, tips first, followed by the root, and then internal nodes in time-forward ordered fashion.


Figure 1. Simple phylogenetic tree with a discrete trait. Numbers in blue represent the node numbers as done in the Newick format in R
Figure 1. Simple phylogenetic tree with a discrete trait. Numbers in blue represent the node numbers as done in the Newick format in R

What is a Markov Chain (CTMC)?

A Markov process \(X_t\) is a stochastic model with the property that the probability of any particular future behavior of the process, when its current state is known exactly, is not altered by additional knowledge concerning its past behavior. In other words “the future only depends on the present and not the past”. The time \(t\) can be continouos or discrete, and the values of the variable we are following through time \(X(t)\) can take can be continuous (i.e., body mass) or discrete (i.e., numbers of chromosomes).

Three important assumptions we will agree on for this workshop:

  1. In phylogenetic comparative methods we use for modeling \(\{X(t), t\geq 0\}\) to represent the evolution of a trait over time \(t\) measured in in millions of years.

  2. The traits that we will model here are discrete, meaning \(X(t)\) takes values only in the non-negative integers \(\mathbb{Z}^+=\{0,1,2,...\}\). We will use the notation \(X_i^n(t)\) to represent what is happening throughout the phylogenetic tree. \(X\) is the trait, \(t\) is time, \(i\) is the value of the trait, and \(n\) is the node number.


Figure 2. Notation for a continuous-time Markov chain over a phylogenetic tree when tracking a trait over nodes
Figure 2. Notation for a continuous-time Markov chain over a phylogenetic tree when tracking a trait over nodes


  1. The Markovian property “the future does not depend on the past” is mathematically represented with a conditional probability \[{P}(X_i(t)|X_j(u), X_k(v))={P}(X_i(t)|X_j(u)) \textrm{ with } v < u < t\]

Figure 3. The Markovian property states that the red and the blue path have the exact same probability
Figure 3. The Markovian property states that the red and the blue path have the exact same probability

Extra note on the Markovian property

What the Markovian property states for phylogenetic trees is that given a node, it is sufficient to focus on its most recent common ancestor to calculate the probability of a trait change.

The conditional probabilites of Markov Chains

As the Markovian property states above we are interested in conditional probability \(P(X_i^n(t)|X_j^m(u))\) of going from \(i\) to \(j\) states in \(t-u\) units of time. This probability is often represented by the notation \(p_{ij}(t-u)\).

For example, in Fig. 3 we are tracking the evolution of a trait \(X\) that has states \(0,1\) (imagine for example the presence or absence of a tail in reptiles). We are interested in calculating the probability that the tail evolved from internal node 5 to tip node 2 in \(10-3=7\) million years, given that tails were not present in the ancestor node 5, that is: \[P(X^2(10)=1|X^5(3)=0)=P_{01}(10-3)=p_{01}(7)\]

This equation that I just wrote is called the transition probability of evolving from 0 to 1 in 7 units of time. In general, we could write then all the transition probabilities in a matrix

\[P(t)= \left( {\begin{array}{cc} p_{00}(t) & p_{01}(t) \\ p_{10}(t) & p_{11}(t) \\ \end{array} } \right)\]

This matrix is organized such that the first subindex in the \(p\) is the row and represents the condition or starting value of the trait (a.k.a “leaving from”), and the second subindex is the final value of the trait (a.k.a. “arriving to”). The condition in the probability makes an important assumption “given that the most recent common ancestor didn’t have a tail”. In addition, the rows add up to 1, for instance, \(p_{00}(t)+p_{01}(t)=1\) because given that the trait was at state 0 it could either stay with the same 0 or evolved to be 1 with probability 1.

Calculating the transition probabilites

Transition probabilities are a wonderful if we could calculate them. They represent the probability of evolutionary change while accounting for uncertainty. To our bad luck, calculating these probabilities is very hard for Mkn models, especially the ones with large numbers of discrete states.

We are going to go on a tour using a new mathematical toolto define Mkn models (or any Markov model with discrete traits) called the transition RATES and the Q-matrix and then go back to show how transition rates are used to define transition probabilities.

Transition rates

Since we cannot calculate directly the probabilities, we propose using transition rates. As their name states, transition rates represent the amount of change over time. Equivalent to understanding the speed at which your driving (miles per hour- km per hour), transition rates tell us about the number of changes we expect from one state value to the next in one million year of evolution (note- the time units will depend on the dating of your tree, if your phylogenetic tree is time-calibrated the units are million years, but if you are using a tree that is the result of a molecular clock estimation the units are molecular change rates).

The rates that we will use are instantaneous transition rates meaning that the changes of one state to the next occur in an interval of time that is infinitesimally small. In mathematics, when we use the word infinitesimal it means we are defining a derivative (or limit when time goes to 0), so the formal relationship between a transition probability and a transition rate is

\[\lim_{h\to 0} P(X(t+h)=j|X(t)=j)=\lim_{h \to 0} p_{ij}(h)=q_{ij}\]

Now, let’s go back to our simplified example with two states, the presence and absence of tails. To define a Mkn model then we would only need two transition rates: \(q_{01}\) and \(q_{10}\). Frequently, Mkn models are represented by circle and arrow diagrams like in Fig. 4.
Figure 4. Circle and arrow diagram for a Mkn model with two states and two transition rates
Figure 4. Circle and arrow diagram for a Mkn model with two states and two transition rates

A second way that is more intuitive for biologists to understand transition rates is their relationship with waiting times. For Markov chains the waiting time between transitioning from one state to another has an exponential probability distribution with mean \(1/q_{01}\). Look at the exponential distribution example below. When the transition rate \(q_{01}=0.5\) the waiting time to evolve from 0 to 1 is in average \(\frac{1}{0.5}=2\) million years, and when the waiting time \(q_{10}=0.03\) the waiting time to transition from 1 to 0 is in average \(\frac{1}{0.03}=33.3\) million years.

library(ggplot2)
library(reshape2)
q_01=0.5
q_10=0.03

waiting_times <- data.frame(wait_01=rexp(1000, rate=q_01),wait_10=rexp(1000, rate=q_10))

data=melt(waiting_times)

ggplot(data,aes(x=value, fill=variable)) +
  geom_density(alpha=0.5)+
  coord_cartesian(xlim=c(0, 50))+
  geom_vline(xintercept = 1/q_01,color="red")+
  geom_vline(xintercept = 1/q_10,color="blue")+
  labs(x="waiting times",y="Exponential probability distribution")+
  theme_classic()

Transition matrix

Now that we have a good grasp on what transition rates are, we can define the next mathematical tool needed to define a Markov chain for a discrete trait. This is the transition matrix for two states called the Q-matrix

\[Q= \left( {\begin{array}{cc} - q_{01} & q_{01} \\ q_{10} & -q_{10} \\ \end{array} } \right)\]

Each row of the Q matrix adds up to 0 (i.e., \(q_{10}+ (-q_{10})=0\)), the reason for this is because the Q-matrix is simply the derivative of the probability transition matrix \(P\) that we define above.

\[P(t)=e^{Qt}\] \[\frac{d P(t)}{dt}= e^{Qt}Q=P(t)Q\]

Going from the Q-matrix back to the matrix of transition probabilities

If given the Q matrix, it is only sufficient to exponentiate and multiply it by time \(t\) in order to obtain the transition probabilities. This step is hard because the exponential of a matrix needs to be calculated by approximations. Mathematicians have dedicated a lot of time and effort to understand exponential of matrices to calculate them numerically. For traits that have more than two states, giving an analytical solution is impossible, but in our small example the transition probability matrix looks like this

\[P(t)=\frac{1}{q_{01}+q_{10}} \left( {\begin{array}{cc} q_{10}+q_{01}e^{-(q_{01}+q_{10})t} & q_{01}-q_{01}e^{-(q_{01}+q_{10})t} \\ q_{10}-q_{10}e^{-(q_{01}+q_{10})t} & q_{01}+q_{10}e^{-(q_{01}+q_{10})t} \\ \end{array} } \right)\]

#Example of calculating the transition probabilities from transition rates for a model with two states
# time is 1 million years. 


prob.mat<-function(q_01,q_10,t){
exp.val<-exp(-(q_01+q_10)*t)
probabilities<-1/(q_01+q_10)*matrix(c(q_10+q_01*exp.val, q_10-q_10*exp.val, q_01-q_01*exp.val, q_01+q_10*exp.val),nrow=2)
return(probabilities)
}

(P.mat<- prob.mat(q_01=0.5, q_10=0.03, t=1))
##            [,1]      [,2]
## [1,] 0.61189148 0.3881085
## [2,] 0.02328651 0.9767135

As you can see, this is already a very complicated set of equations, therefore when defining models for large numbers of states and within software always choose to define your Q-matrix instead of handling the P-matrix. In software, the exponential is calculated numerically as follows

#install.packages("expm")
library(expm)
(Q.mat<-matrix(c(-0.5,0.03,0.5,-0.03), nrow=2))
##       [,1]  [,2]
## [1,] -0.50  0.50
## [2,]  0.03 -0.03
(P.mat<-expm(Q.mat*1))
##            [,1]      [,2]
## [1,] 0.61189148 0.3881085
## [2,] 0.02328651 0.9767135

Useful reading

TBD