Introduction
- Top of page
- Introduction
- Basic Protocol 1: Estimating Phylogeny (Topology and Branch Lengths)
- Basic Protocol 2: Partitioned Data Analysis
- Basic Protocol 3: Model Comparison Using Bayes Factors
- Guidelines for Understanding Results
- Commentary
- Literature Cited
Phylogeny programs, in general, seek to estimate the evolutionary history of a study group from a set of sequences. Several different methodologies exist for estimating phylogenies, including distance-based methods, parsimony, maximum likelihood, and Bayesian inference. Distance-based methods and maximum parsimony estimate a phylogeny based on their respective optimality criteria, e.g., the smallest number of changes required to explain the observed sequence data. Maximum likelihood and Bayesian inference, on the other hand, use stochastic models of evolution to describe the observed data. Bayesian methods are attractive for their ability to directly quantify uncertainty in parameter estimates and because they remain efficient when applied to relatively complex models. However, Bayesian inference requires the specification of prior probabilities that are transformed by the likelihood function into posterior probabilities, which demands special considerations. For a more detailed comparison of different phylogeny inference methods, we refer the reader to more comprehensive reviews, such as Huelsenbeck, Larget, Miller, & Ronquist, (2002); Holder & Lewis (2003); and Yang & Rannala (2012).
Phylogenetic inference in a Bayesian statistical framework aims to estimate the posterior probability distributions of phylogenetic parameters for a given study group. These parameters include the phylogenetic relationships among lineages, the amount of divergence between lineages, rates of substitution, rates of diversification, and many other measures of the tempo and mode of evolution (Huelsenbeck, Ronquist, Nielsen, & Bollback, 2001). Uncertainty in the estimates of these parameters is handled naturally by the Bayesian approach (Huelsenbeck et al., 2002; Holder & Lewis, 2003). The posterior distribution is approximated using numerical algorithms such as Markov chain Monte Carlo (MCMC) sampling (Yang & Rannala, 1997). The popularity of Bayesian phylogenetic methods can be attributed to the straightforward interpretation of the posterior probabilities, the ability to apply complex and mechanistic models of evolution, and their wide availability in a number of software programs, e.g., MrBayes (Ronquist, Teslenko, van der Mark, Ayres, Darling, Höhna, Larget, Liu, Suchard, & Huelsenbeck, 2012), BEAST (Drummond & Rambaut, 2007; Bouckaert, Heled, Kühnert, Vaughan, Wu, Xie, Suchard, Rambaut, & Drummond, 2014), and PhyloBayes (Lartillot, Lepage, & Blanquart, 2009).
The space of described phylogenetic models has expanded rapidly in recent years, giving rise to a wide range of models that vary in their complexity. Simple models of sequence evolution, for example, assume equal base frequencies and equal transition rates between DNA characters (Jukes & Cantor, 1969). Furthermore, simple models assume that all sites in a sequence evolve at the exact same evolutionary rate and via the same process (Yang, 1994). Complex models, however, aim to relax these assumptions by capturing known biological properties, like unequal evolutionary rates at different codon positions (Shapiro, Rambaut, & Drummond, 2006). To keep up to speed with model development, phylogenetic software has also increased in complexity. In a recent paper, we introduced a new approach for phylogenetic model representation—probabilistic graphical models (Höhna, Heath, Boussau, Landis, Ronquist, & Huelsenbeck, 2014)—to enable a flexible and expandable phylogenetic inference platform and a mechanism for visually representing hierarchical models of evolution. Within the probabilistic graphical model paradigm, a statistical model is represented visually as a graph of nodes and edges depicting probability distributions and parameter transformations (see Fig. 6.16.1). These model graphs lay bare the conditional dependence structure of the model. RevBayes is a new phylogenetic inference program that harnesses the power of probabilistic graphical models, allowing users to specify any model with any complexity and estimate posterior probabilities in a Bayesian framework (Höhna, Landis, Heath, Boussau, Lartillot, Moore, Huelsenbeck, & Ronquist, 2016). To interface with the RevBayes core, we designed a new interactive model-specification language called Rev (Höhna et al., 2016), which instantiates a graphical model in computer memory. Both graphical models and the interactive model specification language Rev are central parts of RevBayes and are similar in philosophy and intention to other probabilistic programming environments, e.g., WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000) and Stan (Carpenter, Gelman, Hoffman, Lee, Goodrich, Betancourt, Brubaker, Guo, Li, & Riddell, in press). The complexity of the language and modeling framework of RevBayes may present initial challenges to new users, but it is important to state that once the central concepts are appreciated and understood (this is facilitated by detailed user tutorials and documentation on http://revbayes.com), RevBayes provides a powerful platform for conducting fully integrative Bayesian phylogenetic inference. Analysis of biological data under complex models in a Bayesian framework will better capture statistical uncertainty in phylogenetic parameters and enable greater understanding of the processes driving evolution.
In this unit, we provide three related protocols for performing phylogenetic analyses in RevBayes. Basic Protocol 1 guides the user through the specification of a phylogenetic model of molecular sequence evolution (i.e., a substitution model and a tree topology including branch lengths), including all parameters and MCMC proposals. This procedure is followed by a description of how to apply the MCMC algorithm to approximate the posterior distribution of all model parameters. In Basic Protocol 2, we outline how to construct a model for a multi-locus dataset of protein-coding genes and partition one locus by codon position. Each partition is assumed to be generated from a different substitution model. This second basic protocol extends the core concepts of the first basic protocol to emphasize the generality and flexibility of RevBayes. Basic Protocol 3 gives the steps for comparing two substitution models—the one-parameter Jukes-Cantor model (Jukes & Cantor, 1969) and the five-parameter Hasegawa-Kishino-Yano model (Hasegawa, Kishino, & Yano, 1985)—by estimating the marginal likelihood under each model and comparing the support using Bayes factors.
Basic Protocol 1: Estimating Phylogeny (Topology and Branch Lengths)
- Top of page
- Introduction
- Basic Protocol 1: Estimating Phylogeny (Topology and Branch Lengths)
- Basic Protocol 2: Partitioned Data Analysis
- Basic Protocol 3: Model Comparison Using Bayes Factors
- Guidelines for Understanding Results
- Commentary
- Literature Cited
Probabilistic graphical models are central to RevBayes. The probabilistic graphical model is instantiated in computer memory, variable by variable, by executing a sequence of commands in Rev. Rev is the programming language used by RevBayes. This distinction separates the design of the syntax (i.e., commands) from the implementation of the method. It is of utmost importance to think about a phylogenetic model as a probabilistic graphical model to understand which Rev commands are required to build the desired model.
When applying Bayesian methods, the choice of tree model (a prior on the topology and branch lengths) and the substitution model are central to accurate phylogenetic inference. The graphical model in Figure 6.16.1 represents the commonly used general-time reversible substitution model with among-site rate variation (GTR+Γ+I; Tavaré, 1986) with a uniform prior on tree topologies. Additionally, Figure 6.16.1 provides the corresponding Rev commands to specify the model parameters and structure. Below, we explain and motivate the steps to build this model and estimate the posterior probabilities of parameters using MCMC.
The MCMC algorithm in RevBayes consists of two main components called moves and monitors. Moves are specific tools that update one or several parameters of the model. For example, the sliding-window move updates a single parameter by adding a normally distributed value to the current value of the parameter (for a detailed explanation of common moves used in phylogenetics, see Yang, 2014), while the nearest-neighbor-interchange (NNI) move updates the tree topology by randomly switching neighboring nodes (Höhna, Defoin-Platel, & Drummond, 2008; Höhna & Drummond, 2012). The collection of moves allows the MCMC algorithm to fully explore parameter space. The second main component of the MCMC algorithm are the monitors, which simply define, among other things, the variables to be sampled (i.e., monitored), the format of the output and destination (i.e., stored into a file), and the frequency of sampling. The flexibility of generic monitors enables users in RevBayes to store different variables, like the tree topology, in separate files for specific post-processing.
In this unit, we use RevBayes interactively, by executing every single line in the terminal. RevBayes can also run an analysis from a script file (a text file) by using the source("my_analysis.Rev") command. Running RevBayes from a script file is preferred, because runs are more easily reproduced, and analyses can run unattended. We provide the examples as scripts in the Supplementary Material and on our Web site: http://revbayes.com/tutorials.html.
Necessary Resources
Hardware
-
Standard workstation (e.g., Macintosh, Windows, Linux, or Unix system) or computer cluster. In principle, RevBayes runs on any modern computer architecture. More powerful computers with many CPUs can be helpful for large datasets. The examples described in the protocol as well as small to intermediate datasets run sufficiently well on standard desktop computers.
Software
-
RevBayes is a stand-alone software application that comes with the boost and NCL libraries included (Lewis, 2003). The source code is freely available from https://github.com/revbayes/revbayes and can be compiled on any modern platform using cmake and a C++ compiler (e.g., GCC 4.2 or newer). Additionally, pre-compiled versions of RevBayes for Windows 7 and Mac OS X 10.6 (or higher) are provided (https://revbayes.com). The analyses provided in this protocol were written for RevBayes version 1.0.2 (commit 67426a2). We recommend using RevBayes v1.0.2 or higher for any analyses based on the exercises described below.
Files
RevBayes recognizes any standard file format for molecular sequence data files with aligned DNA sequences, e.g., NEXUS, PHYLIP, and FASTA. All data files and analysis scripts are available for download from our Web site: http://revbayes.com/tutorials.html. For the analyses outlined in this protocol, we will estimate the phylogeny of 23 primate taxa using an alignment of cytochrome b sequences in the file labeled primates_cytb.nex.
Getting started
- 1.
In Unix systems, open a terminal window and type rb in the command line. You should make sure that the RevBayes executable is in your path variable so that you can start RevBayes from any directory. The directory from which you start RevBayes is important in order to use relative file paths for reading in data. On Windows systems, you can either double click the RevBayes executable or open the command-line window and type rb.exe.
- 2.
Load the data into your workspace:
-
data <- readDiscreteCharacterData("data/primates_cytb.nex")
RevBayes reads in standard NEXUS, FASTA, and Phylip formatted files [see unit 6.3 (Desper & Gascuel, 2006) and unit 6.4 (Wilgenbusch & Swofford, 2003) for descriptions of these file formats]. Here we read in the cytochrome b (cyt-b) sequence data from a file called primates_cytb.nex and store the data in a variable named data. The sequence alignment file is stored in the data directory, which should be within the directory from which you started RevBayes. Alternatively, you can provide the full path to read in a file from a different directory. If you want to know the current directory in which RevBayes is running, then you can query this information by typing getwd() into the console window.
- 3.
Query necessary information about the taxa from the data:
-
n_species <- data.ntaxa()
-
n_branches <- 2 * n_species - 3
-
In step 11 of this protocol, we need to know the number of branches in the tree so that we can create the desired number of branch length variables. We obtain the necessary information from the data variable created above using the member method .ntaxa() (which returns the number of taxa) and .taxa() (which returns a vector of taxa, i.e., the names of the species). To list the member methods a variable provides, use the member method .methods(), which is available for every variable, e.g., data.methods().
- 4.
Instantiate helper variables:
In RevBayes, you must create the moves and monitors manually and store them in a vector. Moves are algorithms used to propose new parameter values during the MCMC simulation, and monitors are simple functions that print a subset of variables either to the screen or to a file [also see the section on Markov Chain Monte Carlo (MCMC) Simulation, below]. For convenience, we will create two counter variables that tell us how many moves and monitors we have already created. These counter variables can then be used to add a new move or monitor at the end of the corresponding vectors.
Substitution model
- 5.
Create the stationary frequency parameters:
-
-
pi ∼ dnDirichlet(alpha2)
The first parameter of our model is the vector of the stationary frequencies pi. Every parameter in a Bayesian analysis must have a prior distribution. Prior distributions are required because we are interested in the posterior distribution of the parameters, where the posterior distribution is obtained by calculating the product of the prior distribution and the likelihood function. The stationary frequencies are a vector of probabilities that sum to one. Such a vector is also called a ‘simplex’. The Dirichlet distribution is a natural choice for the prior distribution because the Dirichlet distribution assigns probability densities to a group of parameters that measure proportions and must sum to one. Here, we have specified a four-parameter Dirichlet prior, where each value describes one of the four stationary frequencies of the GTR model. Without strong prior knowledge about the pattern of stationary frequencies, however, we can better reflect our uncertainty by using a vague prior. Notably, all patterns of stationary frequencies have the same probability density under alpha2 <- v(1,1,1,1). Note, that we could also fix the stationary frequencies to a constant/fixed parameter by declaring a constant variable for pi; pi <- simplex(1,1,1,1).
moves[++mvi] = mvBetaSimplex(pi, weight=2.0)
moves[++mvi] = mvDirichletSimplex(pi, weight=1.0)
We need to select moves that propose new values of the stationary frequencies during the MCMC simulation because the stationary frequencies of the GTR substitution model are parameters that we want to estimate, i.e., stochastic variables. Here we choose two moves: the mvBetaSimplex, which updates only a single element (one of the stationary frequencies), and the mvDirichletSimplex, which proposes new values drawn from a Dirichlet distribution centered around the current values. Both moves rescale the stationary frequencies after the proposal so that the values always sum to 1. The first move receives a weight of 2.0 and the second move a weight of 1.0, which implies that the first move will be applied on average 2.0 times per MCMC iteration and the second 1.0 times per iteration (also see the section on Markov Chain Monte Carlo (MCMC) below). We add each move to our vector of moves and automatically increment the counter variable mvi using the pre-increment operator ++mvi.
- 6.
Create the exchangeability rate parameters:
-
-
er ∼ dnDirichlet(alpha1)
Next, we create a stochastic variable for the exchangeability parameters of the GTR substitution model. In RevBayes we have adopted the convention that exchangeability rates sum to 1, i.e., the exchangeability rates are normalized. This convention ensures that the model is identifiable and yields branch lengths in expected number of substitutions. Hence, we can apply a Dirichlet prior distribution to the exchangeability rates. Representing our lack of prior information about the exchangeability rates, we use a flat Dirichlet prior distribution: alpha1 <- v(1,1,1,1,1,1).
moves[++mvi] = mvBetaSimplex(er, weight=3.0)
-
moves[++mvi] = mvDirichletSimplex(er, weight=1.5)
We need to specify moves for the exchangeability rates as we did for the stationary frequencies. Since the exchangeability rates are also of type simplex and drawn from a Dirichlet distribution, we can use the same type of moves as before.
- 7.
Combine exchangeability rates and stationary frequencies into the substitution rate matrix:
Given the exchangeability rates and stationary frequencies, we can deterministically compute the instantaneous substitution rate matrix. We show the relationship of the parameters and the transformation into the rate matrix in the section titled Phylogenetic Models and Theory in the Commentary, below. In RevBayes, we use the function fnGTR given the parameters er and pi to instantiate the deterministic variable Q. The dependencies of the substitution rate matrix Q and its parameters er and pi is shown using dashed arrows in Figure 6.16.1.
- 8.
Model among site rate variation (ASRV):
-
alpha_prior_mean <- ln(1.5)
alpha_prior_sd <- 0.587405
alpha ∼ dnLognormal(alpha_prior_mean, alpha_prior_sd)
-
sr:= fnDiscretizeGamma(shape=alpha, rate=alpha, numCats=4, median=FALSE)
To specify the discrete Gamma ASRV model, we need a deterministic node that is a vector of k rates drawn from a Gamma distribution with k rate categories. The fnDiscretizeGamma function returns this deterministic node and takes four arguments: the shape and rate of the Gamma distribution, the number of categories, and whether to use the mean or median of the quantiles. Since we want to discretize a mean-one Gamma distribution, we can pass in alpha for both the shape and rate. alpha itself is a stochastic variable drawn from a log normal prior distribution with location parameter ln(1.5) and scale of 0.587405. Thus, our resulting 95% prior interval ranges from a vector of site rates of [0.029, 0.235, 0.801, 2.936] to [0.491, 0.798, 1.084, 1.628], representing either a 100-fold difference in rates or a 3-fold difference in rates. This specific prior shows an example of a biologically motivated prior. Note that here, by convention, we set k = 4 using the argument numCats=4 (Yang, 1994). Particular to RevBayes is that the number of rate categories k and the distribution of the rate categories can be exchanged easily, for example, by using a log normal distribution with 6 rate categories instead (sr:= fnDiscretizeDistribution(dnLognormal(mean=0, sd=alpha), num_cats=6)).
moves[++mvi] = mvScale(alpha, lambda=1.0, weight=2.0)
The random variable that controls the rate variation is the stochastic node alpha. We apply a simple scale move to this parameter (for details also see Yang, 2014). The scale move proposes new parameter values by multiplying the current value by a factor eλu, where u is drawn uniformly from (−0.5, 0.5). Here, λ is a tuning parameter that controls the range of proposed scaling factors. For example, if λ =1.0, then the current value is scaled by a factor between e−0.5 and e0.5, and if λ =2.0, then the scaling factor is between e−1 and e1. Tuning parameters can be set manually or tuned automatically to achieve “optimal” performance (Haario, Saksman, & Tamminen, 1999; Roberts & Rosenthal, 2009).
- 9.
Model probability of invariable sites:
This adds a stochastic variable for the probability of a site being invariant, p_inv, to the parameter for rate variation among sites. Since p_inv should be defined as a probability (any number between 0 and 1), we choose a Beta(1,1) distribution as the prior. This flat Beta distribution gives equal probability to any value between 0 and 1.
moves[++mvi] = mvBetaProbability(p_inv, weight=2.0)
For the p_inv variable, we apply a Beta-Probability move which is applicable to stochastic variables of type Probability and uses a Beta distribution to propose new values. We could also have used (additionally) a scaling move (mvScale) or sliding move (mvSlide) as for any other continuous variable (Yang, 2014).
Tree topology and branch lengths
In the previous steps, we specified the substitution model with its parameters. The substitution model is necessary to compute the likelihood of a phylogeny given the data. Now, we specify our main variable of interest: the phylogeny. The phylogeny variable comprises the tree topology and the branch lengths. We will first specify the topology and branch length variables independently, then merge the parameters together to create a phylogeny variable.
- 10.
Specify a uniform prior distribution on the tree topology:
-
out_group = clade("Galeopterus_variegatus")
-
topology ∼ dnUniformTopology(taxa, outgroup=out_group)
In this protocol, we will estimate an unrooted phylogeny. We first specify a uniform distribution on all topologies that have this set of taxa and our outgroup. Hence, every topology has the same prior probability of one over the number of distinct topologies (for computing the number of topologies, see Felsenstein, 1978). Note that the prior probability can become very small because the number of topologies is growing super-exponentially with the number of taxa, but this is no concern for Bayesian phylogenetics because all topologies are equally probable a priori. For this dataset, the outgroup is a single species: the flying lemur (Galeopterus variegatus). However, RevBayes allows you to specify any number of outgroup species. Other software, such as MrBayes, implicitly assume that the first entry in the data file is the outgroup species. In RevBayes, we force users to select the outgroup manually to make all decisions conscious and visible. Since the trees sampled by MCMC are actually all unrooted, they can also be re-rooted after the analysis even if no outgroup is specified (see unit 6.1; Page, 2003).
moves[++mvi] = mvNNI(topology, weight=5.0)
moves[++mvi] = mvSPR(topology, weight=3.0)
We apply two moves on the tree topology: a nearest neighbor interchange move (mvNNI) and a subtree-prune-regraft move (mvSPR). These two moves change only the tree topology. The tree topology is often the most difficult parameter to estimate (mix over). Therefore, more specialized moves that propose new topologies (and branch lengths) using more sophisticated methods are available in RevBayes. For a discussion, see Höhna et al. (2008); Lakner, van der Mark, Huelsenbeck, Larget, & Ronquist (2008); Höhna & Drummond (2012); Yang (2014).
- 11.
Specify an exponential branch length prior:
-
for (i in 1:n_branches) {
-
bl[i] ∼ dnExponential(10.0)
-
moves[++mvi] = mvScale(bl[i])
-
-
Every branch length is represented in this model as its own independent and identically distributed stochastic variable. This is achieved by using a for loop. The for loop corresponds to the repetition of the variable bl (shown as a dashed box in Fig. 6.16.1; also see Fig. 6.16.2) around the bl variable in Figure 6.16.1. In the for loop, we create each branch length variable drawn from an exponential distribution with rate 10. Remember that an exponential distribution with rate 10 has an expectation of one divided by the rate (1/10). Additionally, we compute the total tree length by summing all branch lengths, which is done in the deterministic variable TL. The tree length is not a variable of the model itself, but we might want to monitor it to, for example, learn about the posterior distribution of the tree length itself instead of only the single branch lengths. Even though the exponential branch length prior is the most common choice, it has been shown to bias tree inference (e.g., Yang & Rannala, 2005; Brown, Hedtke, Lemmon, & Lemmon, 2010; Rannala, Zhu, & Yang, 2012). Alternatively, we could have specified a prior distribution on the tree length, such as a Gamma prior, and a distribution on the relative branch length using a Dirichlet distribution (Zhang, Rannala, & Yang, 2012). Then, the actual branch length is the product of the relative branch length and the tree prior.
psi:= treeAssembly(topology, bl)
Finally, we combine the tree topology variable (topology) and branch lengths (bl) into a tree variable with branch lengths. The separation between tree topology and branch lengths allows us more flexibility in specifying prior distributions on each.
Putting it all together
- 12.
Model character evolution along the phylogeny:
-
seq ∼ dnPhyloCTMC(tree=psi, Q=Q, siteRates=sr, pInv=p_inv, type="DNA")
We have fully specified all of the parameters of our phylogenetic model—the tree topology with branch lengths, and the substitution model that describes how the sequence data evolved over the tree with branch lengths. Collectively, these parameters comprise a distribution called the ‘phylogenetic continuous-time Markov chain’, and we use the dnPhyloCTMC distribution to create a stochastic variable seq for the sequence data. This distribution requires several input arguments (arguments marked with * are optional): (1) the tree with branch lengths psi; (2) the instantaneous-rate matrix Q; (*3) the rate categories for the site-specific rates sr; (*4) the probability of a site being invariant p_inv; (*5) the clock rate that scales all branch lengths, which is not part of this model; and (6) the type of character data “DNA”.
- 13.
Attach the data to the sequence variable:
Although we assume that our sequence data are random variables—they are realizations of our phylogenetic model—for the purposes of inference, we assume that the sequence data are ‘clamped’. When this function is called, RevBayes sets each of the stochastic nodes representing the tips of the tree to the corresponding nucleotide sequence in the alignment. This essentially tells the program that we have observed data for the sequences at the tips.
- 14.
Instantiate a model object:
Finally, we wrap the entire model to provide convenient access to the model graph. To do this, we only need to give the model function a single node. With this node, the model function can find all of the other nodes by following the arrows in the graphical model. The variable mymodel now holds its own independent copy of the entire model graph in computer memory.
Performing an MCMC analysis
- 15.
Add monitors to store samples from the MCMC simulation into files:
-
monitors[++mni] = mnModel(filename="output/primates_cytb.log", printgen=10, separator=TAB)
-
monitors[++mni] = mnFile(filename="output/primates_cytb.trees", printgen=10, separator=TAB, psi)
-
monitors[++mni] = mnScreen(printgen=1000, TL)
For our MCMC analysis, we need to set up a vector of monitors to record the states of our Markov chain. The monitor functions are all called mn*, where * is the wildcard representing the monitor type. First, we will initialize the model monitor using the mnModel function. This creates a new monitor object that will output the states for all simple, numeric model parameters (i.e., not the tree and the rate matrix) when passed into a MCMC function. The mnFile monitor will record the states for only the parameters passed in as arguments. We use this monitor to specify the output for our sampled trees and branch lengths: psi. Finally, the screen monitor will report the states of specified variables to the screen with mnScreen. The screen monitor is just for our convenience, to see what is happening. All monitors have an argument called printgen that specifies how frequently samples are stored (i.e., thinning of the samples). Thinning is important to reduce file size and maximize the ratio of effective sample size (ESS) to the number of samples taken.
- 16.
mymcmc = mcmc(mymodel, monitors, moves)
mymcmc.burnin(generations=10000,tuningInterval=200)
mymcmc.run(generations=30000)
With a fully specified model, a set of monitors, and a set of moves, we can now set up the MCMC algorithm that will sample parameter values in proportion to their posterior probability. The mcmc function will create our MCMC object. We may wish to run the .burnin() member function. Recall that this function does not specify the number of states that we wish to discard from the MCMC analysis as burnin (i.e., the samples collected before the chain converges to the stationary distribution). Instead, the .burnin() function specifies a completely separate preliminary MCMC simulation that is used to auto-tune the moves to improve mixing of the MCMC analysis. Additionally, the .burnin() function will move the parameters towards the stationary distribution and thus fewer, if any, samples from the actual MCMC simulation have to be discarded. When the analysis is complete, you will have the monitored files in your output directory. See the section titled Guidelines for Understanding Results for information on evaluating and summarizing MCMC output.
Basic Protocol 2: Partitioned Data Analysis
- Top of page
- Introduction
- Basic Protocol 1: Estimating Phylogeny (Topology and Branch Lengths)
- Basic Protocol 2: Partitioned Data Analysis
- Basic Protocol 3: Model Comparison Using Bayes Factors
- Guidelines for Understanding Results
- Commentary
- Literature Cited
This is an introduction to partitioned phylogenetic analysis. Partitioned analyses allow for different sets of homologous sites to evolve according to different sets of evolutionary parameters—for example, if two genes with different functions face different selection pressures, and they may evolve according to different processes. Within a single protein-coding gene, the third codon position is expected to have a relatively high rate of substitution when compared with first and second codon positions, owing to the structure of the genetic code (Bull, Huelsenbeck, Cunningham, Swofford, & Waddell, 1993; Brandley, Schmitz, & Reeder, 2005; Brown & Lemmon, 2007).
This protocol describes how to perform a partitioned data analysis using RevBayes. A key idea underlying this section is plate notation (Fig. 6.16.3). In a graphical model, a plate represents a set of random variables and their dependencies that are replicated with identical structure and visually represented as a dashed rectangle encompassing the replicated variable nodes. This has a natural correspondence to the for loop in programming languages. In both cases, this helps to keep the model concise and easy to modify. For practical purposes, by instantiating the distribution of branch lengths as a “repetition of variables” in Basic Protocol 1, this creates a plate of branch length variables. Plates are common structures in phylogenetic models, but here we focus on their application to multi-locus partitioned analyses.
In this example, we will consider a partition with four subsets of characters: first and second codon positions for cyt-b, third codon positions for cyt-b, first and second codon positions for COX2 (cytochrome c oxidase subunit 2), and third codon positions for COX2. Each character subset will evolve under its own substitution process, each of which being similar to the single-locus model in Basic Protocol 1. Unlike the substitution process parameters, all character subsets in the partition share a single phylogeny parameter.
Necessary Resources
-
All of the necessary resources for this tutorial are described above in Basic Protocol 1. All data files and analysis scripts are available for download from our Web site, http://revbayes.com/tutorials.html. For this protocol, we will use a second dataset in addition to the cytochrome b alignment analyzed in Basic Protocol 1. This dataset contains 23 primate sequences for the gene cytochrome oxidase II in the file called primates_cox2.nex.
Getting started
- 1.
In Unix systems, open a terminal window and type rb in the command line. On Windows systems, you can either double click the RevBayes executable or open the command-line window and type rb.exe.
- 2.
Load the two alignments from file:
-
data_cytb <- readDiscreteCharacterData("data/primates_cytb.nex")
-
data_cox2 <- readDiscreteCharacterData("data/primates_cox2.nex")
We read in the sequence data from the files primates_cytb.nex and primates_cox2.nex and store them in the variables named data_cytb and data_cox2. See step 2 of Basic Protocol 1 for more information about these commands.
- 3.
Divide the data into partitions:
First, we store two copies of each gene into a vector. Elements data[1] and data[2] correspond to cyt-b, while data[3] and data[4] correspond to cox2.
data[1].setCodonPartition(v(1,2))
data[2].setCodonPartition(3)
data[3].setCodonPartition(v(1,2))
data[4].setCodonPartition(3)
Then, we assign partitions to differentiate third codon positions from first and second codon positions for cyt-b and COX2.
- 4.
Record the dimensions of the dataset:
n_species <- data_cytb.ntaxa()
n_branches <- 2 * n_species - 3
n_data_subsets <- data.size()
In the latter part of this protocol, we need to know the number of branches in the tree and the number of data subsets in the partition to design the structure of the model. See step 3 of Basic Protocol 1 for more information about these commands.
- 5.
Instantiate helper variables:
In RevBayes you create the moves and monitors manually and store them in a vector. For convenience, we will create two counter variables that tell us how many moves and monitors we have already created. These counter variables can then be used to add a new move or monitor add the end of vectors. See step 4 of Basic Protocol 1 for more information about these commands.
Substitution model
- 6.
Declare a for loop over data subsets:
for (i in 1:n_data_subsets)
We use a for loop to assign individual substitution model parameters to each data subset in the partition. The contained code will be executed once for each of the four data subsets. From the graphical modeling perspective, the for loop behaves like plate representation, where each data subset is drawn from a common model structure. Each iteration of the loop essentially creates an independent copy of the model defined in Basic Protocol 1.
- 7.
Begin defining the code block for the for loop to execute:
All code contained between the open curly brace ({) and the matching closed curly brace (}) will be executed once for each value of i between 1 and n_data_subsets times (this for loop is closed in step 13, below). Here, we will create stationary frequencies, exchangeability rates, rate matrices, among-site rate variation multipliers, and invariable site parameters for each of the four character subsets in our partition. Building upon Basic Protocol 1, we are now specifying the model with vectors of parameters, where each element in the vector is associated with a block of characters. For example, er will no longer correspond to the simplex of exchangeability rates for the entire analysis, but rather a vector of four simplices of exchangeability rates for the partitioned analysis, accessed by er[1], er[2], er[3], and er[4]. The following code block is indented to emphasize that it will be executed within a loop.
- 8.
Create the ith stationary frequency parameters:
pi[i] ∼ dnDirichlet(alpha=[1,1,1,1])
moves[++mvi] = mvBetaSimplex(pi[i], weight=2.0)
moves[++mvi] = mvDirichletSimplex(pi[i], weight=2.0)
Each data subset has its own set of stationary frequencies, each of which has a flat Dirichlet prior distribution. Two MCMC moves are assigned to update the parameter. mvBetaSimplex updates one simplex value at a time, whereas mvDirichletSimplex updates all simplex values simultaneously. See step 5 of Basic Protocol 1 for more information about these commands.
- 9.
Create the ith exchangeability rate parameters:
-
er[i] ∼ dnDirichlet(alpha=[1,1,1,1,1,1])
-
moves[++mvi] = mvBetaSimplex(er[i], weight=3.0)
-
moves[++mvi] = mvDirichletSimplex(er[i], weight=1.5)
Each data subset has its own set of exchangeability rates, each of which has a flat Dirichlet prior distribution. See step 6 of Basic Protocol 1 for more information about these commands.
- 10.
Create the ith rate matrix from the stationary frequencies and exchangeability rate parameters:
-
Q[i]:= fnGTR(er[i], pi[i])
Each character subset evolves according to its own GTR rate matrix. Each rate matrix is a function of exchangeability rates and stationary frequencies associated with that particular character subset in the partition. If the value of er[1] or pi[1] changes, it will only cause the value of Q[1] to change; the values of Q[2], Q[3], and Q[4] will remain the same because they are not child nodes of er[1] or pi[1]. See step 7 of Basic Protocol 1 for more information about these commands.
- 11.
Create the ith discrete Gamma distribution to model among site rate variation (ASRV):
-
alpha[i] ∼ dnLognormal(mean=ln(1.5), sd=0.587405)
-
sr[i]:= fnDiscretizeGamma(alpha[i], alpha[i], 4, false)
-
moves[++mvi] = mvScale(alpha[i], lambda=1.0, weight=2.0)
We create a discrete Gamma distribution to model among site rate variation, which takes alpha[i] as a shape and rate parameter. alpha[i] is log-normally distributed; unlike Basic Protocol 1, we set the mean and sd parameters directly rather than creating two constant nodes only to pass those constant nodes as arguments. See step 8 of Basic Protocol 1 for more information about these commands.
- 12.
Create the ith invariable sites parameter:
-
p_inv[i] ∼ dnBeta(alpha=1, beta=1)
-
moves[++mvi] = mvBetaProbability(p_inv[i], weight=2.0)
The proportion of invariable sites is free to vary across subsets in the partition. Each parameter, p_inv[i], has a flat distribution, dnBeta(alpha=1,beta=1), and an MCMC move to sample posterior values. See step 9 of Basic Protocol 1 for more information about these commands.
- 13.
Complete the definition of the for loop code block:
This completes the for loop that was initialized in step 6 above. The contents of the code block will be executed once for each of the four character subsets.
Tree topology and branch lengths
- 14.
Create the tree topology variable:
out_group = clade("Galeopterus_variegatus")
topology ∼ dnUniformTopology(taxa, outgroup=out_group)
The model will assume that all sites in the mitochondrial genome share a single gene tree. First, we assign a uniform prior distribution over all possible topologies that could explain the evolutionary relationships shared by taxa with flying lemur set to be the outgroup.
moves[++mvi] = mvNNI(topology, weight=5.0)
moves[++mvi] = mvSPR(topology, weight=3.0)
Then, we add two moves to inform MCMC how to explore the space of possible tree topologies: nearest neighbor interchange (mvNNI) and subtree-prune-regraft (mvSPR). See step 10 of Basic Protocol 1 for more information about these commands.
- 15.
Create branch length parameters for the tree:
-
for (i in 1:n_branches) {
- bl[i] ∼ dnExponential(10.0)
-
moves[++mvi] = mvScale(bl[i])
-
-
Next, we assign a prior distribution over the expected number of substitutions per site per branch. For each branch, bl[i], we create an exponentially distributed stochastic node and create a scale move (mvScale) to enable MCMC to mix over the posterior distribution of branch lengths. We also monitor the tree length by creating a deterministic node, sum_br_lens, whose value always equals sum(bl). See step 11 of Basic Protocol 1 for more information about these commands.
- 16.
Per-subset tree and tree length:
-
psi:= treeAssembly(topology, bl)
We instantiate non-clock tree, psi, whose value is determined by the function treeAssembly by mapping the vector of branch lengths, bl onto the topology variable, topology.
- 17.
Per-subset scaling factor:
-
for (i in 1:n_data_subsets) {
-
-
-
-
part_rates[i] ∼ dnGamma(2,2)
-
moves[++mvi] = mvScale(part_rates[i])
-
-
TL[i]:= sum_br_lens * part_rates[i]
-
We assume that each subset of characters evolves according to its own substitution process, and thus its own substitution rate. The relative difference in rates can be treated as a multiplicative factor. We choose the first subset to have a multiplicative factor of 1 and for the remaining subsets evolve at some rate relative to the first subset. If we choose the prior distribution for the remaining factors to have a mean of one, the expected prior distribution over relative rates will favor equal rates across the partition. If the data support differential substitution rates, however, we expect to see the values of part_rates to deviate from 1.
To accomplish this, we construct a for loop over the four subsets and assign the constant rate multiplier of 1.0 to the first element and the stochastic rate multiplier dnGamma(2,2) to all other elements. The value part_rates[i] will later be passed into the phylogenetic substitution process, dnPhyloCTMC, via the branchRates argument.
Putting it all together
- 18.
Model character evolution along the phylogeny:
-
for (i in 1:n_data_subsets) {
-
seq[i] ∼ dnPhyloCTMC(tree=psi, Q=Q[i], branchRates=part_rates[i], siteRates=sr[i], pInv=p_inv[i], type="DNA")
-
-
Each subset of data in the partitioned analysis evolved according an independent phylogenetic substitution process. When declaring the relationship between the sequence data and their underlying distribution, it is important to recall the model assumptions for this exercise. All data subsets have independent substitution process parameters but share a common phylogeny (topology and branch lengths). Note that tree=psi is the only parameter that does not correspond to an element in a vector (e.g., Q[i], p_inv[i]). After creating each seq[i] variable, we want to condition on the partitioned multiple sequence alignments we input earlier in the protocol. Just as with the single-locus analysis in Basic Protocol 1, we call seq[i].clamp(data[i]) to inform the model that seq[i] has observed the outcome data[i] of the evolutionary process defined by dnPhyloCTMC(..) in the previous line. See steps 12 and 13 of Basic Protocol 1 for more information about these commands.
- 19.
Instantiate a model object:
The full graph of the model parameters is now specified. Calling the model function wraps all the variables in the graph and provides an interface between the graphical model and analysis objects, such as Mcmc. See step 14 of Basic Protocol 1 for more information about these commands.
Performing an MCMC analysis
- 20.
Add monitors to store samples from the MCMC simulation into files:
monitors[++mni] = mnModel(filename="output/primates_partition.log", printgen=10, separator = TAB)
monitors[++mni] = mnFile(filename="output/primates_partition.trees", printgen=10, separator = TAB, psi)
monitors[++mni] = mnScreen(printgen=1000, TL)
We create three monitors: a model monitor to record the sampled parameter values to file, a file monitor to record the sampled phylogenies to file, and a screen monitor to report the tree length values to the screen. See step 15 of Basic Protocol 1 for more information about these commands.
- 21.
Run a MCMC simulation:
-
mymcmc = mcmc(mymodel, monitors, moves)
-
mymcmc.burnin(generations=10000,tuningInterval=200)
-
mymcmc.run(generations=30000)
Calling mcmc(mymodel, monitors, moves) creates the Mcmc analysis object. During burn-in, we tune the efficiency of the MCMC proposals found in moves for 10000 generations but do not record the MCMC state. After burn-in, we run the MCMC for 30000 generations, updating the state according to the tuned moves vector and recording the state according to the monitors vector. See step 16 of Basic Protocol 1 for more information about these commands.
Basic Protocol 3: Model Comparison Using Bayes Factors
- Top of page
- Introduction
- Basic Protocol 1: Estimating Phylogeny (Topology and Branch Lengths)
- Basic Protocol 2: Partitioned Data Analysis
- Basic Protocol 3: Model Comparison Using Bayes Factors
- Guidelines for Understanding Results
- Commentary
- Literature Cited
For most datasets of molecular sequence alignments, several (possibly many) substitution models of varying complexity are plausible a priori. As a result, we need an objective way to compare different models and quantify the evidence in favor of each one so that we may choose the best model for our data. Choosing the wrong model can have a severe impact on the inferred phylogenetic tree (Posada & Crandall, 2001). In Bayesian statistics, model selection is based on Bayes factors (Jeffreys, 1961; Kass & Raftery, 1995), which provides a method for hypothesis testing and evaluating the support for a given model (for more detailed information on Bayes factors, please see the Bayesian Model Selection section in the Commentary below).
This protocol will describe the steps for comparing two models in RevBayes. Specifically, we will investigate the evidence in favor of a Jukes-Cantor (Jukes & Cantor, 1969) substitution model relative to the evidence supporting the Hasegawa-Kishino-Yano (Hasegawa et al., 1985) model of sequence evolution for the primate cytochrome b alignment. Computing the Bayes factor requires that one first calculate the marginal likelihood of each candidate model. We demonstrate two approaches, stepping-stone sampling and path sampling, to estimating marginal likelihoods that have been applied in phylogenetics (Lartillot and Philippe, 2006; Fan, Wu, Chen, Kuo, & Lewis, 2011; Xie, Lewis, Fan, Kuo, & Chen, 2011; Baele, Li, Drummond, Suchard, & Lemey, 2013).
Both stepping-stone sampling and path sampling rely on power posteriors to compute the marginal likelihood of a model (Baele et al., 2013). Power posterior analyses are similar to standard MCMC analyses of the posterior distribution, with the difference that for a power posterior distribution the posterior distribution in an MCMC simulation is raised to a power β. All other components of the model and the MCMC algorithm remain unchanged. In practice, one has to run an MCMC simulation for many values of β = [0,1], commonly between 30 and 200 different values. Each analysis is considered as a stepping-stone or element of a path from the prior to the posterior. Finally, the marginal likelihood is computed by the stepping-stone and path-sampling formulae, which are both different estimators of the same marginal likelihood using the same power posterior analyses.
Note that in this third protocol, we use simplified models of molecular evolution (e.g., removed the ASRV component of the model) to avoid complexity and to demonstrate the modularity of the graphical-model framework. The key aspect of this protocol is to show a simple, flexible, and generic approach to estimating marginal likelihoods and selecting among any set of models in RevBayes.
Getting started
- 2.
Load the sequence data from file:
-
data <- readDiscreteCharacterData("data/primates_cytb.nex")
This protocol will describe a simple procedure for comparing the substitution model for one dataset. Therefore, only one alignment is loaded (see step 2 of Basic Protocol 1 for more information about this command).
- 3.
Create the dataset-dimension and helper variables:
n_species <- data.ntaxa()
n_branches <- 2 * n_species - 3
n_data_subsets <- data.size()
See steps 3 and 4 of Basic Protocol 1 for more information about these commands.
The Jukes-Cantor model
- 4.
Create the constant node representing the rate matrix under the Jukes-Cantor (JC) substitution model (Jukes and Cantor, 1969):
The fnJC function creates a Q-matrix where the rate of change between every state is equal. This function takes the number of states, e.g., 4 for nucleotides, as an argument. Importantly, one can use this function to create a rate matrix for characters with k states and equal rates of change between all states. Note that because the rates of change between states are fixed (all equaling 1), this makes the Q-matrix a constant node and no moves are defined for the parameters of the Jukes-Cantor model.
Tree topology and branch lengths
- 5.
Specify the prior distribution on the tree topology:
-
out_group = clade("Galeopterus_variegatus")
-
topology ∼ dnUniformTopology(taxa, outgroup=out_group)
-
moves[++mvi] = mvNNI(topology, weight=5.0)
-
moves[++mvi] = mvSPR(topology, weight=3.0)
- 6.
Define the branch length priors and assemble the tree:
-
for (i in 1:n_branches) {
-
bl[i] ∼ dnExponential(10.0)
-
moves[++mvi] = mvScale(bl[i])
-
-
-
psi:= treeAssembly(topology, bl)
Putting it all together
- 7.
Specify the model of character evolution along the phylogeny and attach the observed sequence data:
seq ∼ dnPhyloCTMC(tree=psi, Q=Q, type="DNA")
-
See steps 12 and 13 of Basic Protocol 1 for more information about these commands.
Compute power posterior distributions
- 9.
Add the monitors to write MCMC samples to file and to the screen:
-
monitors[++mni] = mnModel(filename="output/primates_cytb_JC.log", printgen=10,separator=TAB)
-
monitors[++mni] = mnFile(filename="output/primates_cytb_JC.trees", printgen=10,separator=TAB, psi)
-
monitors[++mni] = mnScreen(printgen=1000, TL)
This creates files to store the MCMC samples of the model parameters. However, it is important to note that the sampler (specified below) for this analysis is not the same as in the first two protocols. Because we are sampling from power posteriors, the MCMC samples are not valid samples from the true target distribution. Thus, the samples in these files are best for troubleshooting the power-posterior run. See step 15 of Basic Protocol 1 for more information about these commands.
- 10.
Run MCMC under a series of power posteriors:
-
mypowerp = powerPosterior(mymodel, moves, monitors, "output/powerp_JC.out", cats=127, sampleFreq=10)
-
mypowerp.burnin(generations=10000, tuningInterval=200)
-
mypowerp.run(generations=10000)
To estimate the marginal likelihood of a given model, we must first run the MCMC under a series of power posteriors. This essentially raises the posterior probability to a power between 1 and 0 in an iterative manner. This method computes a vector of powers from a beta distribution, then executes an MCMC run for each power step while raising the likelihood to that power. In this implementation, the vector of powers starts with 1, sampling the likelihood close to the posterior and incrementally sampling closer and closer to the prior as the power decreases. With the power-posterior samples saved to file, we can use stepping-stone sampling (Xie et al., 2011) or path sampling (Lartillot & Philippe, 2006) to estimate the marginal likelihood under this model (see steps 22 to 25, below).
Evaluate a second model
- 11.
Clear the workspace of the previously defined model (JC model):
Unless RevBayes has been restarted, the workspace must be cleared of the previous model.
- 12.
Re-load the data from file, and instantiate the helper variables:
-
data <- readDiscreteCharacterData("data/primates_cytb.nex")
-
n_species <- data.ntaxa()
-
n_branches <- 2 * n_species - 3
-
-
n_data_subsets <- data.size()
-
-
The Hasegawa-Kishino-Yano (HKY) model
- 13.
Specify a flat Dirichlet prior on the stationary frequencies:
-
-
sf ∼ dnDirichlet(sf_prior)
-
moves[++mvi] = mvBetaSimplex(sf, weight=3)
Like the GTR model specified in step 5 of Basic Protocol 1, the HKY model also assumes that the base frequencies are not equal to one another.
- 14.
Specify log normal prior on the transition-transversion rate ratio:
-
kappa ∼ dnLognormal(0, 1)
-
moves[++mvi] = mvScale(kappa, weight=3)
- 15.
Create a deterministic variable for the instantaneous rate matrix:
Similar to the specification of the GTR model in step 7 of Basic Protocol 1, the instantaneous rate matrix of the HKY model is a deterministic node in the graphical model. This node is created using the fnHKY function computed via the base frequencies and transition-transversion rate ratio.
Tree topology and branch lengths
- 16.
Specify the uniform prior distribution on the tree topology:
-
out_group = clade("Galeopterus_variegatus")
-
topology ∼ dnUniformTopology(taxa, outgroup=out_group)
-
moves[++mvi] = mvNNI(topology, weight=5.0)
-
moves[++mvi] = mvSPR(topology, weight=3.0)
- 17.
Set up the stochastic nodes representing branch lengths and assemble the tree in a deterministic node:
-
for (i in 1:n_branches) {
-
bl[i] ∼ dnExponential(10.0)
-
moves[++mvi] = mvScale(bl[i])
-
-
-
psi:= treeAssembly(topology, bl)
Putting it all together
- 18.
Specify the model of character evolution along the phylogeny and attach the observed sequence data:
-
seq ∼ dnPhyloCTMC(tree=psi, Q=Q, type="DNA")
-
See steps 12 and 13 of Basic Protocol 1 for more information about these commands.
- 19.
Instantiate the model object:
Compute power posterior distributions
- 20.
Create a vector of monitors that output the MCMC samples to file and to screen:
-
monitors[++mni] = mnModel(filename="output/primates_cytb_HKY.log", printgen=10, separator=TAB)
-
monitors[++mni] = mnFile(filename="output/primates_cytb_HKY.trees", printgen=10, separator=TAB, phylogeny)
-
monitors[++mni] = mnScreen(printgen=1000, TL)
See step 9, above (in this protocol), for more information about these commands.
- 21.
Run MCMC under a series of power posteriors:
-
mypowerp = powerPosterior(mymodel, moves, monitors, "output/powerp_HKY.out", cats=127, sampleFreq=10)
-
mypowerp.burnin(generations=10000,tuningInterval=200)
-
mypowerp.run(generations=10000)
See step 10, above (in this protocol), for more information about these commands.
Estimate marginal likelihoods under the Jukes-Cantor model
- 22.
Use stepping-stone sampling to calculate marginal likelihoods from the output of the powerPosterior function:
-
ss_JC = steppingStoneSampler(file="output/powerp_JC.out", powerColumnName=likelihoodColumnName="likelihood")
The steppingStoneSampler function reads the output file produced by the powerPosterior function and computes the marginal likelihood using stepping-stone sampling. The command above assigns the sampler to a variable called ss_JC and reads in the power-posterior file saved under the JC model.
- 23.
Assign the stepping-stone estimate of the marginal likelihood to a variable in the workspace:
-
ssmlnl_JC = ss_JC.marginal()
A stepping-stone sampler object has a member function called marginal that returns the marginal likelihood computed from the power-posterior using the stepping-stone approach (Fan et al., 2011; Xie et al., 2011). In the Rev code above, the value is assigned to the variable called ssmlnl_JC.
- 24.
Use path sampling to calculate marginal likelihoods from the output of the powerPosterior function:
-
ps_JC = pathSampler(file="output/powerp_JC.out", powerColumnName="power", likelihoodColumnName="likelihood")
Path sampling (also called thermodynamic integration) is an alternative approach to computing the marginal likelihood from a series of power posteriors (Lartillot and Philippe, 2006; Baele, Lemey, Bedford, Rambaut, Suchard, & Alekseyenko, 2012). Like the stepping-stone sampler above, the pathSampler function also reads in the power-posterior output file and can be assigned to a workspace variable.
- 25.
Assign the path-sampling estimate of the marginal likelihood to a workspace variable:
-
psmlnl_JC = ps_JC.marginal()
Similar to the stepping-stone approach, we can assign the marginal likelihood computed by path sampling to a workspace variable.
Estimate marginal likelihoods under the HKY model
- 26.
Use stepping-stone sampling to calculate marginal likelihoods:
-
ss_HKY = steppingStoneSampler(file="output/powerp_HKY.out", powerColumnName="power", likelihoodColumnName="likelihood")
-
ssmlnl_HKY = ss_HKY.marginal()
Like steps 24 and 25 above, performed for the JC model, a stepping-stone sampler is created for the HKY model.
- 27.
Use path sampling to calculate marginal likelihoods:
-
ps_HKY = pathSampler(file="output/powerp_HKY.out", powerColumnName="power", likelihoodColumnName="likelihood")
-
psmlnl_HKY = ps_HKY.marginal()
Like steps 22 and 23 above, performed for the JC model, a path sampler is created for the HKY model.
Compute Bayes factors
- 28.
Compute the ln-Bayes factor in favor of the JC model using stepping-stone sampling:
To compute the Bayes factor, simply calculate the difference in marginal likelihoods between the two models under a given type of sampler. This procedure is described further below, and the difference is equal to K, which is defined in Equation 6.16.2 (see Guidelines for Understanding Results). The commands above will print the value of Kto the screen, which is the support in favor of the JC model relative to the HKY model (with marginal likelihoods estimated under stepping-stone sampling; Fan et al., 2011; Xie et al., 2011).
- 29.
Compute the ln-Bayes factor in favor of the JC model using path sampling:
The commands above will print the value of Kto the screen, which is the support in favor of the JC model relative to the HKY model (with marginal likelihoods estimated under path sampling; Lartillot and Philippe, 2006; Baele et al., 2012).
- 30.
Refer to Guidelines for Understanding Results for information about how to interpret ln-Bayes factors.
Evaluate the GTR model
- 31.
Estimate the marginal likelihood under the GTR model and evaluate the support under the GTR relative to the JC and HKY models using Bayes factors.
The steps outlined in this protocol and the model specification in Basic Protocol 1 provide all the necessary commands needed for one to estimate the marginal likelihood under GTR for this dataset. Then, pair-wise comparisons using Bayes factors will enable model selection among the JC, HKY, and GTR models.
Guidelines for Understanding Results
- Top of page
- Introduction
- Basic Protocol 1: Estimating Phylogeny (Topology and Branch Lengths)
- Basic Protocol 2: Partitioned Data Analysis
- Basic Protocol 3: Model Comparison Using Bayes Factors
- Guidelines for Understanding Results
- Commentary
- Literature Cited
The goal of a Markov chain Monte Carlo analysis is to generate samples from a target distribution. In a Bayesian phylogenetic analysis, the target distribution is the joint posterior distribution,
, where θ includes all the estimated model parameters, including the tree topology, branch lengths, exchangeability rates, stationary frequencies, etc. Thus, we use MCMC simulation to approximate the posterior distribution, and to find a range of parameters with high posterior probability density (i.e., the 95% credible interval). In Basic Protocol 1 and Basic Protocol 2, the posterior samples are separated into two files. The .trees file contains the sampled posterior distribution of phylogenies (topologies and branch lengths) stored in Newick format. The .log file contains a tab-delimited “trace” of each model parameter, with one parameter per column. In both cases, each row corresponds to the MCMC state when sampled by the corresponding RevBayes monitor.
Summarizing the Posterior Distribution of Phylogenies
The primary goal of many phylogenetic analyses is to produce a point estimate of the phylogeny, including its topology and branch lengths. We will compute the maximum a posteriori (MAP) phylogeny in RevBayes. First, we read in the posterior sample of non-clock phylogenies.
- treetrace = readTreeTrace("output/primates_cytb.trees", treetype="non-clock")
Next, we compute the MAP tree using the mapTree function.
- mapTree(treetrace,"output/primates_partition_MAP.tre")
The function first finds the topology with the highest posterior probability. Given that topology, the mapTree then uses the mean posterior branch length distribution to provide a smooth estimate the MAP tree's branch lengths. Finally, mapTree converts the MAP tree to a Newick string, annotates it with useful quantities, such as the posterior probabilities of clade support, then saves the Newick string to file. Figure 6.16.4 corresponds to the MAP tree estimated in Basic Protocol 1 when viewed in the tree-visualization program FigTree (http://tree.bio.ed.ac.uk/software/figtree).
This analysis of the cytochrome b sequence data of primates shows some interesting results. There has been considerable disagreement about the placement tarsiers (genus Tarsius) in the primate phylogeny (Yoder, 2003; Chatterjee, Ho, Barnes, & Groves, 2009; Hartig, Churakov, Warren, Brosius, Makałowski, & Schmitz, 2013). In previous studies, most support is given to a Tarsius-Haplorhini sister relationship (Haplorhini includes New World monkeys, Old World monkeys, and anthropoid primates), although several molecular studies have found conflicting results. The sequence of our Tarsius representative is placed as a sister lineage to all Strepsirrhini (Strepsirrhini includes lemurs, lorises, and bushbabies). However, the support for this Tarsius-Strepsirrhini sister relationship is comparably weak, with a posterior probability of 0.5438, albeit being the most probable evolutionary relationship given our model and data. Importantly, we want to stress that this result is based on a very small dataset intended for this exercise. Nevertheless, this result exemplifies the type of topological questions one can readily answer from our analysis.
There are alternative ways to summarize the posterior distribution of phylogenies then using the MAP. For example, other approaches to combining the posterior distribution into a single point estimate are consensus tree methods (Holder, Sukumaran, & Lewis, 2008; Heled & Bouckaert, 2013). The output of RevBayes can be summarized with consensus tree methods implemented in other software, such as DendroPy (Sukumaran & Holder, 2010).
Posterior Estimates for Standard Parameters
The mnModel monitor in RevBayes will discover all non-constant nodes in the graphical model and save their values to a tab-delimited file, known as a trace file. Each row corresponds to an iteration when the MCMC state was sampled, and each column corresponds to a particular variable in the MCMC state space, such as the shape parameter for among site rate variation.
First, we will review the effective sample size (ESS) for parameter estimates. Parameters with large ESS values generally indicate that the MCMC gathered enough samples to accurately estimate the parameter's marginal posterior density. If the ESS is low for a given parameter (indicated in red), the MCMC may need to be run for more generations or the move responsible for sampling the parameter in question may need to be assigned a greater weight.
Second, we will compare the posterior distribution between two independent runs to assess convergence. If the posterior samples do not appear equivalent, then one (or both) MCMC analyses failed to produce valid samples from the same posterior distribution. Note that even if both posterior samples appear equivalent, it is still possible that neither MCMC reached convergence. As a rule of thumb, it is easier to show that an MCMC run has failed than it is to show it succeeded. The visual test described here is not rigorous, but adequate to rule out gross MCMC failures. More sophisticated tests are available in the R package (R Core Team, 2013) coda (Plummer, Best, Cowles, & Vines, 2006).
Two independent runs may be accomplished by adding the argument nruns=2 in step 16 of Basic Protocol 1.
- mymcmc = mcmc(mymodel, monitors, moves, nruns=2)
Third, assuming that the posterior sample appears valid, we will compare it to the prior distribution. The posterior distribution is proportional to the prior distribution times the likelihood function. This means that posterior distribution and prior distribution differ when the likelihood function is informative—i.e., when the parameter estimates are informed by data. We can easily sample from the prior distribution under any model by setting the underPrior=true flag in the mcmc.run method in RevBayes.
To record an estimate of the joint prior distribution under the model, first change the names of the output files in step 15 of Basic Protocol 1:
-
monitors[++mni] = mnModel(filename="output/primates_cytb.prior.log", printgen=10, separator = TAB)
-
monitors[++mni] mnFile(filename="output/primates_cytb.prior.trees",printgen=10, separator = TAB, psi)
Then add the underPrior=true argument to the following commands in step 16 of Basic Protocol 1:
-
mymcmc.burnin(generations=10000,tuningInterval=200, underPrior=true)
-
mymcmc.run(generations=30000, underPrior=true)
Posterior estimates that look very different from prior estimates tend to be strong results. However, when posterior and prior estimates appear to be similar, it often means the data are not informative enough to pull the posterior distribution away from the prior. If the prior and posterior looked identical, one would expect to get a similar parameter estimate even if no data were used! This may motivate collecting more data, re-designing the model to ensure all parameters are identifiable, or considering alternative priors.
Viewing the two independent posterior estimates and the prior estimate in Tracer, it is likely that the alpha shape parameter for among-site rate variation was adequately sampled, was estimated with valid samples from the posterior, and does not exhibit strong prior sensitivity (Fig. 6.16.5).
Interpreting Marginal Likelihoods and Bayes Factors
Alternatively, you can directly interpret the strength of evidence in favor of M0 in log space by comparing the values of
to the appropriate scale (Table 6.16.1, second column). In this case, we evaluate
in favor of model M0 against model M1 so that:
Table 6.16.1. The Scale for Interpreting Bayes Factors by Harold Jeffreys (1961)| Strength of evidence | bf(m0,m1) | log(BF(M0,M1)) | log10(BF(M0,M1)) |
|---|
|
| Negative (supports M1) | <1 | <0 | <0 |
| Barely worth mentioning | 1 to 3.2 | 0 to 1.16 | 0 to 0.5 |
| Substantial | 3.2 to 10 | 1.16 to 2.3 | 0.5 to 1 |
| Strong | 10 to 100 | 2.3 to 4.6 | 1 to 2 |
| Decisive | >100 | >4.6 | >2 |
Thus, values of
around 0 indicate that there is no preference for either model. Variations on the Bayes factor and further background are provided in the Commentary section below.
In Basic Protocol 3, you will find that the values of K computed in steps 28 and 29 indicate that the data support the HKY model of substitution. It is important to note, however, that these two models are just a subset of the possible substitution models available for describing the evolution of nucleotide data. RevBayes provides a straightforward approach to comparing models (as outlined above), allowing users to evaluate the range of models selected for their analysis.