Evolutionary Models Implemented in MrBayes 3

From MbWiki

Jump to: navigation, search

Contents

Evolutionary Models Implemented in MrBayes 3

MrBayes implements a wide variety of evolutionary models for nucleotide, amino acid, restriction site (binary), and standard discrete data. In addition, there are several different ways of modeling the process generating phylogenetic trees. An overview of all the models is given in diagrammatic form in the Appendix. Here, we provide a brief description of each model with some comments on their implementation in MrBayes and advice on how you can apply them successfully to your data.

Nucleotide Models

MrBayes implements a large number of DNA substitution models. These models are of three different structures. The "4by4" models are the usual simple models of nucleotide evolution. The "Doublet" model is intended for stem regions of ribosomal DNA, where nucleotides evolve in pairs. Finally, the "Codon" models group the nucleotides in triplets and model evolution based on these. The type of nucleotide model is set in MrBayes with lset nucmodel; for instance, if you want to use the doublet model, the command is lset nucmodel=doublet. The default setting is 4by4.

Simple Nucleotide Models

There has been more work based on the simple four by four nucleotide models than on any other type of evolutionary model for molecular data. MrBayes 3 implements three main types of models; you select among those by setting the number of substitution types using lset nst to 1, 2, or 6. The widely used General Time Reversible (GTR) model has six substitution types (lset nst=6), one for each pair of nucleotides. The instantaneous rate matrix for the GTR model is (note that we order the rows and columns alphabetically - A, C, G, T - unlike some other authors)

MATRIX

The GTR model (Tavaré, 1986) has four stationary state frequencies (<math>\pi_A</math>, <math>\pi_C</math>, <math>\pi_G</math>, <math>\pi_T</math>) and six rate parameters (<math>r_{AC}</math>, <math>r_{AG}</math>, <math>r_{AT}</math>, <math>r_{CG}</math>, <math>r_{CT}</math>, <math>r_{GT}</math>). In total, there are eight free parameters, since one of the stationary state frequencies and one of the substitution rates are determined by the others. By default, MrBayes uses a "flat" Dirichlet distribution (with all distribution parameters set to 1) as the prior for both the stationary state frequencies and the substitution rates. This is a reasonable uninformative prior that should work well for most analyses.

During the analysis, both the stationary state frequencies and substitution rates are estimated. If you want to fix the stationary state frequencies or substitution rates, you can do that by using the prset command. For instance, assume that we want to fix the stationary state frequencies to be equal, converting the GTR model to the so-called SYM model. This is achieved by prset statefreqpr=fixed(0.25,0.25,0.25, 0.25) or, more conveniently, prset statefreqpr=fixed(equal). The substitution rate matrix now becomes

MATRIX

and the stationary state frequencies are no longer estimated during the analysis. Similarly, it is possible to fix the substitution rates of the GTR model using prset revmatpr=fixed. Assume, for instance, that we want to fix the substitution rates to be (<math>r_{AC} = 0.11</math>, <math>r_{AG} = 0.22</math>, <math>r_AT = 0.12</math>, <math>r_{CG} = 0.14</math>, <math>r_{CT} = 0.35</math>, <math>r_{GT} = 0.06</math>). Then the correct statement would be prset revmatpr=fixed(0.11,0.22,0.12,0.14,0.35, 0.06). Note the order of the rates. The substitution rates can be given either as percentages of the rate sum, as here, or they can be scaled to the <math>r_{GT}</math> rate. The former representation is the Dirichlet parameterization used internally by MrBayes. By default, MrBayes reports substitution rates in the Dirichlet format but you can request conversion of sampled rates to the scaled format using the report command if you prefer this representation. One disadvantage with the scaled format is that the posterior distribution tends to be strongly skewed such that the arithmetic mean of the sampled values is considerably higher than the mode and the median. Therefore, consider using the median instead of the mean when reporting a posterior distribution sampled in the scaled format.

Before using the GTR model for some of your data, we recommend that you make sure there are at least a few possible substitutions of each type. For instance, if there is not a single GT substitution in your data, it will be difficult to estimate the GT substitution rate. In such cases, you should consider either the HKY model (nst=2) or the F81 model (nst=1) instead of the GTR model. The HKY model (Hasegawa, Kishino and Yano, 1985) has different rates for transitions (<math>r_{ti}</math>) and transversions (<math>r_{tv}</math>):

MATRIX

The HKY model is often parameterized in terms of the ratio between the transition and transversion rates, <math>\kappa = r_{ti}/r_{tv}</math>, and this is the default format used by MrBayes when reporting samples from the posterior distribution. Internally, however, MrBayes uses the Dirichlet parameterization in which the transition and transversion rates are expressed as percentages of the (unscaled) rate sum. If you prefer, you can have MrBayes report the sampled values using the Dirichlet format instead of the ratio format by using the command report tratio=dirichlet. The caveats described above for the GT-scaled substitution rates also apply to the transition / transversion rate ratio. In the .p files, the ratio will be referred to as kappa and the transition and transversion rate proportions will be referred to as ti and tv. The setting of the report tratio option will determine whether you will see a single kappa column or the two ti and tv columns.

As with the GTR model, you can fix both the stationary state frequencies and the transition / transversion rate ratio of the HKY model. If you fix the stationary state frequencies to be equal, you will get the K2P model (Kimura, 1980). Finally, MrBayes implements the F81 model (Felsenstein, 1981), which assumes that all substitution rates are equal but stationary state frequencies are not, that is

MATRIX

If the stationary state frequencies are fixed to be equal using prset statefreqpr=fixed(equal), you will get the simplest of all nucleotide substitution models, the JC model (Jukes and Cantor, 1969).

A large number of other subsets of the GTR model are often used in Maximum Likelihood inference. Why do we not allow more substitution model types in MrBayes? One of the most important advantages of the Bayesian approach is that it allows you to integrate out uncertainty concerning the model parameters you have little information about. Thus, Bayesian inference is relatively robust to slight over-parameterization of your model. In addition, the MCMC sampling procedure is typically efficient in dealing with complex multi-parametric models. For these reasons, it is less important in the Bayesian context to find the simplest possible model that can reasonably represent your data. If you use a model-testing procedure and it suggests a four by four nucleotide model not implemented in MrBayes, then you should obtain good results using the next more complex model available in the program.

The Doublet Model

The doublet model of MrBayes is intended for stem regions of ribosomal sequences, where nucleotides pair with each other to form doublets. The nucleotide pairing results in strong correlation of substitutions across sites – when there is a substitution at one site it is typically accompanied by a compensatory substitution at the paired site. If the correlation between paired sites is not accounted for, parametric statistical methods will overestimate the confidence we should have in the best tree(s). Incidentally, the same is true for parsimony and the non-parametric bootstrap.

There are various ways to model the evolution of nucleotide doublets. One method is to focus on the common doublets, A-T and C-G in particular. MrBayes uses a more complex model, originally formulated by Schoniger and von Haeseler (1994), where all doublets are taken into account. The central idea in this model is that one common doublet is converted into another through a two-step process. In the first step, one of the nucleotides is substituted with another according to a standard four by four model of nucleotide change. In the second step, the matching nucleotide is changed according to the same standard four by four model. Thus, in this model there is no momentary change from one doublet to another doublet; this always occurs through an intermediate, rare doublet.

Assume that we are using a GTR model for the single nucleotide substitutions, that i and j are two different doublets, that <math>d_{ij}</math> is the minimum number of nucleotides that must be changed when going between i and j, and that mn is the pair of nucleotides that change when going between i and j when <math>d_{ij} = 1</math>. Then the elements <math>q_{ij}</math> of the instantaneous rate matrix Q of the doublet model can be expressed as follows

MATRIX

for the case when i differs from j; the diagonal elements (i = j) are determined, as usual, to balance the rows in the instantaneous rate matrix to sum to zero. This gives the instantaneous rate matrix (only 7 rows and columns out of 16 shown):

MATRIX

Instead of using the GTR model, we could of course have used the HKY or F81 model, resulting in obvious modifications of the <math>r_{mn}</math> values. Use lset nst to change among those options; for instance, use lset nst=6 to choose the GTR model. The default model is the F81 model (lset nst=1). Before the doublet model can be used, it is necessary to specify all the nucleotide pairs in the alignment. This is done using the pairs command, most conveniently in a MrBayes block in a data file. For instance, pairs 1:10, 2:9; would pair nucleotide 1 with 10 and 2 with 9. See the kim.nex data file for an example of an analysis using the doublet model.

Codon Models

The codon models implemented in MrBayes are based on the first formulations of such models (Goldman and Yang, 1994; Muse and Gaut, 1994). The approach is similar to that used in the doublet model. A codon can change to another only through steps of one nucleotide change at a time. These steps are modeled using a standard four by four nucleotide model. There is one additional complication though: some nucleotide changes are synonymous while others lead to changes in amino acids and thus may be subject to selection (a factor modifying the substitution rate). Assume that i and j are different codons, that a(i) is the amino acid coded for by codon i, that dij is the minimum number of nucleotide changes involved in changing between them, and that w is the ratio of the non-synonymous to the synonymous substitution rate. The off-diagonal elements of the instantaneous rate matrix can now be defined as

with the diagonal elements being defined to balance the rows of the instantaneous rate matrix, as usual. The single-nucleotide substitution rates can be modeled using the GTR, the F81, or the JC model, as before. Use lset nst to change among those options; for instance, use lset nst=6 to choose the GTR model. The default model is the F81 model (lset nst=1).

Invoking the codon model is easy; just make sure that the aligned DNA or RNA sequences start with a nucleotide in codon position 1 and that they end with a nucleotide in codon position 3. Also, make sure that the sequences do not contain any stop codons. To figure out whether a codon is a stop codon and whether a particular single-nucleotide change involves an amino acid change, MrBayes uses one of several genetic codes. By default, MrBayes uses the universal code but you can select other codes by using the lset code command. Note that the codon models are computationally demanding. Whereas the computations for the simple four by four models need to deal with only 16 Q-matrix and transition probability elements (4x4), the codon model computations need to process more than 3,600 Q-matrix and transition probability elements resulting in these runs being roughly 200 times slower. The codon models also require much more memory than the four by four models, about 16 times as much.

The simplest codon model, described above, assumes that all amino acid sites are subject to the same level of selection (the same w factor). However, MrBayes also implements models accommodating variation in selection across sites. This allows you to detect positively selected amino-acid sites using a full hierarchical Bayes analysis (that is, an analysis that does not fix tree topology, branch lengths or any substitution model parameters but instead integrates out uncertainty in all these factors).

The w variation models work much like the model accommodating rate variation across sites according to a gamma distribution. The likelihood of each site is calculated under several different w values and then the values are summed to give the site likelihood. A difference is that the stationary frequency of each omega category is estimated, instead of being fixed as in the case of the gamma distribution. There are two variants implemented in MrBayes, and they are invoked using the lset omegavar option. In the Ny98 model (Nielsen and Yang, 1998), there are three classes with potentially different w values: 0 < w1 < 1, w2 = 1, and w3 > 1. The M3 model also has three classes of w values but these values are less constrained in that they only have to be ordered w1 < w2 < w3. If, for instance, you would like to invoke the M3 model, use the command lset omegavar=M3.

When you use a model with variation in selection pressure across sites, you probably want to infer the positively selected sites. If you select report possel=yes before you start your analysis, MrBayes will calculate the probability of each site being in a positively selected omega class. For the M3 model, for instance, the likelihood of the site is calculated under each of the three categories, taking the category frequencies into account, and then the likelihoods are summed to yield the total likelihood of the site. Finally, the proportion of this sum originating from categories that are positively selected (those that have an w value larger than 1); this is the posterior probability of the site being positively selected.

Once the probabilities of each site being positively selected are printed to file, they can be summarized using the standard sump command. When interpreting the output from the Ny98 model, it is helpful to know that pi(-), pi(N) and pi(+) are the frequencies of the negatively selected, neutral and positively selected categories, respectively, and omega(-), omega(N) and omega(+) are the corresponding w values. The M3 model parameters are instead labeled pi(1), pi(2) and pi(3) for the category frequencies and omega(1), omega(2) and omega(3) for the w values. The probability of a codon being positively selected is labeled by the site numbers in the original alignment. Thus pr+(16,17,18) is the probability of the codon corresponding to the original nucleotide alignment sites 16, 17, and 18 being in a positively selected omega category.

Amino-acid Models

MrBayes implements a large number of amino-acid models. They fall in two distinct categories: the fixed-rate models and the variable-rate models. The former have both the stationary state frequencies and the substitution rates fixed, whereas one or both of these are estimated in the latter.

Fixed Rate Models

The Poisson model (Bishop and Friday, 1987) is the simplest of the fixed rate models. It assumes equal stationary state frequencies and equal substitution rates; thus, it is analogous to the JC model for nucleotide characters. The rest of the fixed-rate models have unequal but fixed stationary state frequencies and substitution rates reflecting estimates of protein evolution based on some large training set of proteins. These models include the Dayhoff model (Dayhoff, Schwartz and Orcutt, 1978), the Mtrev model (Adachi and Hasegawa, 1996), the Mtmam model (Cao et al., 1998; Yang, Nielsen, and Hasegawa, 1998), the WAG model (Wheland and Goldman, 2001), the Rtrev model (Dimmic et al., 2002), the Cprev model (Adachi et al., 2000), the Vt model (Muller and Vingron, 2000) and the Blosum62 model (Henikoff and Henikoff, 1992). Each model is appropriate for a particular type of proteins. For instance, if you are analyzing mammal mitochondrial proteins, you might want to use the Mtmam model. Invoke that model by typing prset aamodelpr=fixed(mtmam).

Estimating the Fixed Rate Model

MrBayes allows a convenient way of estimating the fixed-rate model for your amino acid data instead of specifying it prior to the analysis. If you choose this option, MrBayes will allow the MCMC sampler to explore all of the fixed-rate models listed above, including the Poisson model, by regularly proposing new models. When the MCMC procedure has converged, each model will contribute to your results in proportion to its posterior probability. For instance, if you are analyzing mammal mitochondrial proteins, it is likely that the Mtmam model will contribute most to the posterior distribution but it is possible that some other models, for instance the Mtrev model, will contribute significantly too. A nice feature of the MCMC model estimation is that the extra computational cost is negligible.

To allow so-called model jumping between fixed-rate amino acid models, simply set the prior for the amino acid model to mixed, prset aamodelpr=mixed, prior to analysis. During the run, MrBayes will print the index of the current model to the .p file(s) in the aamodel column. The indices of the models are as follows: 0 = Poisson, 1 = Jones, 2 = Dayhoff, 3 = Mtrev, 4 = Mtmam, 5 = Wag, 6 = Rtrev, 7 = Cprev, 8 = Vt, 9 = Blosum. When you use the sump command, you will get a table with the names of the amino acid models and their posterior probabilities.

Variable Rate Models

There are two variable-rate models implemented in MrBayes for amino acid data. The Equalin model is a “glorified” F81 model in that it allows the stationary state frequencies of all amino acids to be different but assumes the same substitution rate. Thus, the instantaneous rate matrix for this model is

and it has 19 free parameters (20 stationary state frequencies minus one because the frequencies sum to 1) that will be estimated from data. The other variable-rate model is the “glorified” GTR model, which allows all stationary state frequencies and substitution rates to vary. Thus, the instantaneous rate matrix for this model is

and the model has 19 free stationary state frequency parameters and 189 free substitution rate parameters. The Bayesian MCMC approach is good at handling uncertainty in multi-parameter models, so the GTR model may be used successfully with moderate-size data sets, but the model is so parameter-rich that you need a fairly sizable data set to be able to estimate all parameters with reasonable precision.

The GTR model can be used to express a user-derived fixed rate model other than those already implemented in MrBayes. Simply use the prset command to fix the stationary state frequencies and substitution rates of the GTR model to the desired values. You need to set two options, prset aarevmatpr=fixed(<190 comma-separated values>) and prset statefreqpr=fixed(<20 comma-separated values>). Once the values are fixed prior to analysis, the MCMC procedure will not change them and they will remain the same throughout the analysis.

Restriction Site (Binary) Model

MrBayes implements a simple F81-like model for restriction sites and other binary data. The instantaneous rate matrix for this model is simply

Any asymmetry in the rate of 0 to 1 and 1 to 0 transitions is expressed in terms of the stationary state frequencies. Thus, if the stationary frequencies are p0 = 0.25 and p1 = 0.75, then the rate of 0 to 1 transitions is 3 times as high as the rate of transitions in the other direction (p1 / p0 = 3).

A problem with some binary data sets, notably restriction sites, is that there is an ascertainment (coding) bias such that certain characters will always be missing from the observed data. It is impossible, for instance, to detect restriction sites that are absent in all of the studied taxa. MrBayes corrects for this bias by calculating likelihoods conditional on the unobservable characters being absent (Felsenstein, 1992). The ascertainment (coding) bias is selected using lset coding. There are five options: (1) there is no bias, all types of characters could, in principle, be observed (lset coding=all); (2) characters that are absent (state 0) in all taxa cannot be observed (lset coding= noabsencesites); (3) characters that are present (state 1) in all taxa cannot be observed (lset coding=nopresencesites); (4) characters that are constant (either state 0 or 1) in all taxa cannot be observed (lset coding=variable); and (5) only characters that are parsimony-informative have been scored (lset coding=informative). For restriction sites it is typically true that all-absence sites cannot be observed, so the correct coding bias option is noabsencesites.

The binary model is useful for a number of character types other than restriction sites. For instance, the model can be used for gap characters. The presence and absence of gaps must be coded consistently for all characters; let us assume here that absence of a gap is coded as 0 and presence as 1. Since the detection of gaps is typically contingent on observing some sequence length variation, neither all-absence nor all-presence characters can be observed. Thus, the correct ascertainment bias for gap characters is variable. The parameters p0 and p1 would represent the rate at which insertions and deletions occur, respectively (assuming that state 0 denotes absence of a gap).

The binary model can also be used for ecological, morphological, or other binary characters of arbitrary origin. However, if the binary model is applied to more than one character, then there is an implicit assumption that the state labels are not arbitrary for those characters. That is, the 0 state in one character must somehow be comparable to the 0 state in the others. For instance, 0 could mean absence (or presence) of a particular type of feature, such as a wing vein, a restriction site, or a gap in a DNA sequence. It is not appropriate to apply the default binary model to a set of characters where the state labels are arbitrary, as is true of most morphological characters. Thus, we can possibly estimate the rate of loss versus gain of wing veins over a set of consistently coded wing venation characters, but we cannot compare the rate of loss of antennal articles to the rate at which a yellow patch evolves into a green patch. If state labels are truly arbitrary, then the stationary state frequencies of the binary model must be fixed to be equal, such that the estimation of model parameters becomes independent of the labeling of character states. An alternative is to consider the standard model, which provides more sophisticated ways of dealing with arbitrary state labels.

When is the correction for ascertainment bias important? This is strongly dependent on the size of the tree (the sum of the branch lengths on the tree). The larger the tree, the less important the correction for ascertainment bias becomes. In our experience, when there are more than 20-30 taxa, even the most severe bias (only informative characters observed) is associated with an insignificant correction of the likelihood values.

Standard Discrete (Morphology) Model

The model used by MrBayes for “standard” discrete data is based on the ideas originally presented by Lewis (2001). Essentially, the model is analogous to a JC model except that it has a variable number of states, from 2 to 10. For instance, a three-state standard character would be evolving according to the instantaneous rate matrix

Because all rates are the same, we can maintain the essential property of standard characters, namely that state labels are arbitrary. Thus, the standard model assures that you will get the same results regardless of the way in which you label the states.

In morphology-based parsimony analyses, one sometimes distinguishes between ordered and unordered characters. In ordered characters, evolution between some states is assumed to go through intermediate states. MrBayes implements a stochastic model for such characters. For a three-state character assumed to be ordered (by convention in the sequence 0-1-2), the instantaneous rate matrix is

Note that the instantaneous rate of going between the two end states is 0, that is, a transition from 0 to 2 or from 2 to 0 has to go through state 1. By default, MrBayes treats all standard characters as unordered. To change this, use the ctype command. For instance, if you want to treat characters 3 and 4 as ordered you need to include the statement ctype ordered: 3 4; in your MrBayes block (or enter it using the command line, if you prefer).

The number of states of each standard character is determined by MrBayes simply by looking at the state codes in your matrix. Thus, a three-state model will be used for a three-state character and a six-state model for a six-state character. MrBayes does not check that all state codess are used, it simply finds the largest state code in the matrix for each character. Thus, make sure that you use the state codes 0, 1, and 2 for a three-state character and state codes 0, 1, 2, 3, 4, and 5 for a six-state character.

Because state labels are arbitrary in the standard model, we cannot estimate unequal stationary state frequencies or substitution rates (recall that the stationary state frequencies are an important factor in determining the latter). However, it is still possible to allow the state frequencies (rates) to vary over sites according to some appropriate distribution. MrBayes uses a symmetric Dirichlet distribution for this purpose. For binary data, the analogy of the Dirichlet distribution is called the beta distribution; MrBayes uses Dirichlet and beta interchangeably for the distribution depending on context. The approach is similar to the one used to allow rate variation across sites according to a gamma distribution: we calculate the likelihood of a site assuming different discrete categories of asymmetry and then we sum the values to obtain the site likelihood.

The symmetric Dirichlet distribution has one parameter that determines its shape, just like the alpha parameter determines the shape of the gamma distribution. The larger the parameter of the symmetric Dirichlet, the less transition rate (stationary frequency) asymmetry there is across sites. By default, the parameter is fixed to infinity (prset symdirihyperpr=fixed(infinity)); this corresponds to the standard assumption of no transition rate asymmetry across sites: the rate of going from 0 to 1 is equal to the rate of going from 1 to 0 for all characters. The prior is called a hyperprior because it concerns a distribution that controls the distributions of other model parameters (stationary state frequencies in this case). If you want to allow transition rate (stationary frequency) asymmetry in standard data, then simply select another (hyper)prior. For instance, you can fix the parameter to 1.0, which would result in a uniform prior on the proportions of the state frequencies.

In practice, MrBayes uses a discrete approximation of the Dirichlet distribution for binary characters; five categories are used by default (change this with lset nbetacat). For instance, assume that we fix the hyperprior to 1.0 and then evaluate the likelihood of one binary character using five discrete beta categories. MrBayes would then calculate the likelihood of the character assuming that the stationary state frequencies of the two states were 0.1:0.9, 0.3:0.7, 0.5:0.5, 0.7:0.3 and 0.1:0.9. The five category likelihoods would then be multiplied by 0.20 (there is a probability of 0.20 of being in each of the categories) and then summed up to give the total likelihood of the character. For multi-state characters, MrBayes does not use the discrete approximation; instead, it uses the MCMC procedure to explore different stationary state frequency proportions.

Parsimony Model

MrBayes implements an incredibly parameter-rich model, first described by Tuffley and Steel (1997). It orders trees in terms of their maximum likelihood in the same way as the parsimony method would order them in terms of their parsimony score; hence we call it the parsimony model. The model is also referred to as the No-Common-Mechanism model because it treats each branch length for each character as a separate, completely independent parameter. In principle, a Bayesian MCMC analysis using the parsimony model should integrate out the branch lengths but MrBayes 3 uses a simpler approach, in which the branch lengths are fixed to their maximum likelihood values (infinity if there is a change on the branch and zero otherwise). This type of approach, where some parameters are fixed prior to the Bayesian analysis according to some non-Bayesian estimate, is typically referred to as an empirical Bayes method. Future versions of MrBayes may implement the true (hierarchical) Bayesian approach to the parsimony model but we expect the results to be very similar under both approaches.

The parsimony model is much less parsimonious with respect to parameters than any other model implemented in MrBayes. Consider, for instance, an analysis of 1,000 characters and 100 taxa. The parsimony model would have about 200,000 free parameters (200 branches times 1,000 characters). A more typical GTR + G + I model would have only little more than 200 parameters, about 1,000 times fewer parameters. In this sense, the standard stochastic models are much more parsimonious than the parsimony model. Several problems are associated with the excessive number of parameters. Statistical inconcistency is perhaps the best known of these but, more fundamentally, a model like the parsimony model does not offer much in terms of generalities that can be inferred about the evolutionary process.

The Goldman (1993) model is another example of a parameter-rich stochastic model that orders trees in the same way as the parsimony method. In this model, the branch lengths are the same but the ancestral states are estimated for all characters and all nodes in the tree. For an analysis of 100 taxa and 1,000 characters, this results in approximately 100,000 free parameters. The Goldman model is actually very similar to the No-Common-Mechanism model; it makes little difference if the character-specific and tree-section-specific parameters are introduced at the nodes or at the branches. The Goldman model is not implemented in MrBayes.

We would like to emphasize that we do not recommend the use of the parsimony model. We have included it in MrBayes only to allow users to explore its properties and contrast it with the other models implemented in the program. The parsimony model is not the default model used for standard (morphological) data in MrBayes. The default standard data model is described above and is similar to the models used for nucleotide and amino acid data. By default, MrBayes does not use the parsimony model at all; you have to invoke it using lset parsmodel=yes.

Rate Variation Across Sites

By default, MrBayes assumes that all characters evolve at the same rate (measured in expected changes per site over the tree). There are four ways in which you can allow rate variation across sites. The simplest method is to assume that rates vary over sites according to a gamma distribution (Yang, 1993). The gamma model can be combined with spatial autocorrelation between the rates at adjacent sites, the autocorrelated gamma model. A completely different approach to rate variation across sites is to allow a proportion of sites to be invariable. This model can be combined with the gamma model but not with the autocorrelated gamma model. Finally, it is possible to divide characters into groups evolving at different rates, the partitioned or site specific rate model.

Gamma-distributed Rate Model

The commonly used model of gamma-shaped rate variation across sites is invoked using lset rates=gamma. The shape of the gamma distribution is determined by the so-called a (alpha) parameter. When this parameter is small (below 1), the distribution takes on an L-shaped form with a few sites evolving rapidly while most sites are conserved. Conversely, when a is above 1 the distribution becomes similar to a normal distribution with less and less variation in rates across sites as a becomes larger.

In practice, the gamma distribution is approximated using a small number of discrete rate categories (Yang, 1994). By default, four rate categories are used; you can change this setting by using lset ngammacat. For instance, if you want to use eight discrete rate categories the appropriate command is lset ngammacat=8. The computational complexity is proportional to the number of categories used. An analysis with four discrete gamma categories is four times as slow as an analysis with no rate variation across sites and twice as fast as one with eight categories. (Note that some documentation currently says "ncat" instead of "ngammacat"; ngammacat is what will work.) In MrBayes 3.2, the usegibbs option allows you to avoid most of the increase in computational cost associated with using a discrete approximation of gamma-distributed rates. See the LSet UseGibbs Option description.

The shape parameter (a) of the gamma distribution is similar to the shape parameter of the Dirichlet distribution of stationary state frequencies used for standard data (see above) in that it controls the distribution of another model parameter (the site rates). Therefore, the prior probability distribution used for the shape parameter can be referred to as a hyperprior. The default prior used in MrBayes is a uniform distribution on the interval (0.05,50). The sampled values of the shape parameter are found under the column heading alpha in the .p file(s).

Autocorrelated Gamma Model

In this model, rates vary across sites according to an autocorrelated gamma model where the rate at each site depends to some extent on the rates at adjacent sites (Yang, 1995). The spatial autocorrelation is measured by the r (rho) parameter, which ranges from -1 (negative autocorrelation, that is, adjacent sites tend to have wildly different rates) to 1 (adjacent sites have very similar rates). The default prior probability for rho is a uniform distribution covering the entire interval (-1,1).

In the worst case, a small symmetric tree, the extra computational complexity incurred by invoking the auto correlated gamma model instead of the gamma model is comparable to a doubling of the number of taxa in the analysis. In more typical cases, moderate to large data sets, the additional computational cost is negligible and equivalent to adding a single taxon. As with the gamma model, the autocorrelated gamma distribution is approximated with a number of discrete rate categories determined by lset ngammacat.

As described by Yang (1995), protein-coding sequences tend to have a three-position offset in their autocorrelation. That is, the first codon position sites tend to have rates that are correlated with the adjacent first-position sites, second-position site rates are correlated with adjacent second-position rates, etc. You can take this effect into account by partitioning your sites into the three codon positions and then applying a separate autocorrelated gamma model to each of the categories.

You might want to invoke an autocorrelated gamma model with the same correlation coefficient for a dataset consisting of several concatenated genes. If so, it is necessary to inform MrBayes about the break points between different genes. Otherwise, the rates at the first site of each gene except the first one will be erroneously compared to the rates at the last site in the preceding gene. The command databreaks is provided for this purpose. For instance, if there are only two genes in your data set, the first with 960 sites, you would specify the break between them with the statement databreaks 960. Note that you specify the break by giving the last sequence site before the break. The databreaks command is only needed when you invoke a single autocorrelated gamma model for a multigene dataset. The databreaks command cannot be used to partition a data set.

Proportion of Invariable Sites

A completely different approach to rate variation is to allow a proportion of sites to be invariable. This model is invoked using lset rates=propinv. The proportion of invariable sites is referred to as pinvar; it can vary from 0 (no invariable sites) to 1 (all sites are invariable). The default prior is a uniform distribution on the interval (0,1); change it using prset pinvarpr.

The proportion of invariable sites model can be combined with the gamma model using lset rates=invgamma. Although this model is slightly better than the simple gamma model for many data sets, it sometimes results in a bimodal or ridge-like posterior probability distribution. In particular, it is not uncommon to see two peaks in the posterior, one with a low proportion of invariable sites and significant rate variation in the gamma distribution (low alpha value) and the other with a high proportion of invariable sites and moderate amounts of rate variation in the gamma distribution (moderately high alpha value). If you have a posterior of this kind, you should not be surprised if Metropolis-coupling results in rapid (instantaneous) shifts from one mode to the other during the stationary phase of the analysis. The reason for this is that different chains are likely to explore different peaks in the posterior, and successful swapping involving the cold chain is likely to result in mode-jumping. Also, you should consider presenting the entire distribution of the sampled alpha and pinvar parameters since simple point estimates of each parameter would be misleading.

Partitioned (Site Specific) Rate Model

For protein-coding nucleotide sequences, a site-specific rate model is often used, allowing each codon position (first, second and third codon position sites) to have its own rate. This results in a model with three rates, two of which are free to vary (since the average rate is 1.0 by definition). More generally, we might have different character divisions (separate genes, morphology, etc) which potentially evolve at very different rates.

In MrBayes 3, we provide a general mechanism for setting up these models based on partitioning the data set and then unlinking parameters across the partitions. Assume for instance that we want to set up a site specific rate model for a data set with one sequence. We first set up the codon site partitioning scheme using the following lines in a MrBayes block:

  charset pos1 = 1-.\3;
  charset pos2 = 2-.\3;
  charset pos3 = 3-.\3;
  partition by_codon = 3: pos1, pos2, pos3;
  set partition = by_codon;

The character sets are first defined using the dot sign (.) to mark the last character in the data set and the \3 sequence to include every third character in the specified range. Then a partitioning scheme called by_codon is defined using the previously named character sets. Finally, the partitioning scheme called by_codon is invoked using the set command.

When we process these commands in MrBayes using the execute command, the characters are divided into three sets corresponding to the codon positions. By default, however, all model parameters including the rate will be shared across partitions. To allow the rates to differ across partitions, we need to change the prior for rates using prset. Specifically, prset ratepr=variable invokes partition-specific rates. The partition-specific rate parameter is referred to as ratemult, and the individual rates are labeled m{1}, m{2}, etc for the rate (multiplier) of character division 1, 2, etc. See below for more information on how to set up partitioned models.

Inferring Site Rates

When you are allowing rate variation across sites, you may be interested in inferring the rates at each individual site. By default, the site rates are not sampled during a MCMC run. You need to request the sampling of these values using report siterates=yes. The rates will be referred to as r(<site number>). For instance, r(45) is the inferred rate at site (character) 45 of your data set.

Rate Variation Across the Tree: The Covarion Model

For both nucleotide sequence and amino-acid data, MrBayes allows rates to change across the tree under a covarion-like model (Tuffley and Steel, 1998; Huelsenbeck, 2002; see also Galtier, 2001). Specifically, the covarion-like model assumes that a site is either “on” or “off”. When it is on, it evolves under a standard four-by-four nucleotide or 20 by 20 amino acid model but when it is off, it does not change at all. The switching between on and off states is controlled by two rate parameters, s01 (from off to on) and s10 (from on to off). The instantaneous rate matrix of the nucleotide variant (also referred to as the covariotide model), assuming a GTR model for the on state, is

,

where k is a scaling constant determined by the proportion of time the sites spend in the on state. The matrix can be simplified into

,

where each R element is a four by four matrix: R1 contains the rates in the off state (all rates are 0), R2 and R3 describe the switching process (the diagonal elements are either s01 or s10), and R4 is the chosen model for the evolution in the on state.

The covarion-like model can be described as a general case of the proportion of invariable sites model (Huelsenbeck, 2002). As the switching rates go to zero, the proportion of these rates represented by the switch to the off state (s10) becomes identical to the proportion of invariable sites. When the switching rates are zero, there is no exchange between the on and off states and the characters in the off state remain off throughout the tree; in other words, they are invariable sites.

Note that the covarion-like model implemented in MrBayes differs from the original covarion model in that sites switch completely independently of each other between the on and off states. To invoke the covarion-like model, simply use lset covarion=yes and then choose the desired nucleotide or amino acid model using the other lset and the prset options. The covarion-like model can be combined with the gamma model of rate variation across sites.

Topology and Branch Length Models

The topology and branch length models in MrBayes are set using the prset topologypr options, which deal with the tree topology prior, and the prset brlenspr options, which deal with the branch lengths.

Unconstrained and Constrained Topology

There are two choices for the prior probability distribution on topology: uniform or constrained. By default, topologies are not constrained in the prior (prset topologypr=uniform), resulting in equal prior probability being associated with all possible labeled trees (unless a different topology prior is induced by the branch length model, see below). There are two instances in which you might want to constrain the topology: (1) when you want to contrast a hypothesis of monophyly for a group with the more general hypothesis with no topological constraints; and (2) when you want to infer ancestral states for a particular node in the tree. In both cases, you specify the constraint(s) first by listing the taxa that should form a monophyletic group. For instance, if you wanted to constrain taxa 4, 5 and 6 to be monophyletic, you would use

 constraint my_constraint -1 = 4 5 6;

This defines a constraint called “my_constraint” forcing taxa 4, 5 and 6 to form a monophyletic group in all trees that are sampled from the chain. In future versions of MrBayes, the value following the name of the constraint (-1 here), will give the relative probability of trees having the constrained partition. A negative number will force the constraint to always be present in the sampled trees; a positive number will specify how many times more likely the trees with the constraint are compared to the trees not having it. In version 3.1, however, MrBayes ignores this value and always treats the constraint as absolute.

When you define constraints, make sure that you have the outgroup selected correctly. By default, MrBayes uses the first taxon in the data matrix as the outgroup. You can change this by using the outgroup command. For instance, if you want taxon number 7 called “My_taxon” to be the outgroup, either use outgroup 7 or outgroup My_taxon. MrBayes 3.1 only allows a single taxon as the outgroup. Before the constraints take effect, you have to invoke them by using prset topologypr=constraints(<comma-separated list of constraints>). For instance, to enforce the constraint my_constraint defined above, use prset topologypr=constraints(my_constraint).

Non-clock (Standard) Trees

If you do not want to enforce a molecular clock, you choose an unconstrained branch length prior. Actually, you do not have to do anything because unconstrained branch lengths are the default. You can associate unconstrained branch lengths with either a uniform prior from 0 to some arbitrary value or an exponential prior. The default is an exponential distribution with parameter 10 (Exponential(10)); it has an expectation of 0.1 (= 1/10) but (in principle) it allows branch lengths to vary from 0 to infinity. The exponential distribution apparently puts a lot more probability on short branches than on long branches. However, because transition (substitution) probabilities change rapidly at small branch lengths but only very slowly at long branch lengths, the exponential prior is actually closer to an uninformative prior than the uniform distribution. We advise against using a uniform prior on branch lengths because of the large prior probability it puts on long branches and their close-to-random substitution probabilities.

To change the prior on unconstrained branch lengths you use prset brlenspr. For instance, assume you wanted to use an exponential prior with parameter 1 instead of the default prior. This prior is set by typing prset brlenspr=unconstrained: exponential(1).

Strict Clock Trees

MrBayes implements three strict clock models: the simple (uniform) model, the birth-death model, and the coalescence model. The birth-death model and coalescence models both have additional parameters describing the tree-generating process, whereas the simple model does not.

In the birth-death model (see Yang and Rannala, 1997, for a Bayesian implementation), trees are generated according to a birth-death model with a speciation and an extinction rate. The model, as implemented in MrBayes, can also be associated with a sampling probability of terminal lineages. The priors for these three parameters are set using the speciationpr, extinctionpr, and sampleprob parameters of the prset command.

In the coalescence model, the tree generating process is looked at from the opposite perspective, backward in time. Instead of lineages branching, this model sees them as coalescing into fewer and fewer ancestral lineages. This process occurs at a rate determined by the q (theta) parameter; it is also affected by the ploidy of the data (haploid or diploid). The prior for the theta parameter is set using prset thetapr. The ploidy level is set using lset ploidy.

The simple clock model is invoked by prset brlenspr=clock:uniform. There is only one additional parameter in the simple clock model, namely the total tree height. The prior for this parameter is set by prset treeheightpr. The default prior is an exponential distribution with parameter 1.0 (Exponential(1.0)).

Relaxed Clock Trees

Relaxed clock models and functions for dating are not implemented in MrBayes 3.1. According to current plans, they will be available in version 3.2.

Partitioned Models

MrBayes provides great flexibility in setting up partitioned models. By default, the characters are divided into partitions based on the data type. If there is only one data type in the matrix, then all characters will be in a single partition. The default partitioning scheme is called default. For information on how to set up a file with mixed data types, see the example file cynmix.nex and the tutorial in section 3 of this manual.

You can easily set up partitioning schemes that divide the characters up further than the default partition does using the partition command. The most convenient way of partitioning data is to define character sets first using the charset command. For instance, assume that you have concatenated nucleotide sequences from three genes in your data set of length 1962, 1050, and 2082 sites, respectively. Then you create character sets for those three genes using

 charset gene1 = 1-1962;
 charset gene2 = 1963-3012;
 charset gene3 = 3013-.;

in a MrBayes block. Note the use of the dot (.) as a synonym of the last site. You can also use the “backslash n” sequence to include every nth character in the preceding range of characters (see the description of the site specific rate model above). Once the character sets are defined, the partitioning scheme based on the genes is defined with the partition command and selected using the set command:

partition by_gene = 3: gene1, gene2, gene3;
set partition=by_gene;

Here, by_gene is the name we chose for the partitioning scheme. The name is followed by an equal sign, the number of partitions, and then a comma-separated list of characters to include in each partition. Note that MrBayes requires the partitioning scheme to include all characters. Say, for instance, that you wanted to run an analysis with only gene 1 and gene 2. Then define a two-partition scheme and exclude the characters represented by gene 3:

partition gene1&2 = 2: gene1, gene2 gene3;
exclude gene3;
set partition=gene1&2;

If the only purpose of the partition gene1&2 is to allow exclusion of gene 3, then gene 3 can of course be included in either of the two partitions before being excluded.

Once the partitions have been correctly set up, MrBayes allows you to set models for individual partitions using the lset applyto and prset applyto mechanism. For instance, assume that we have two partitions, a standard data partition (partition 1) and a nucleotide partition (partition 2), and want to apply a GTR model to the nucleotide data, gamma-shaped rate variation to both partitions, and allow the partition rates to be different. Then we would use the commands

 lset applyto=(2) nst=6;
 lset applyto=(all) rates=gamma;
 prset applyto=(all) ratepr=variable;

By default, all model parameters that are identical and have the same prior probability distribution associated with them, are linked across partitions (they are assumed to be one and the same parameter). To unlink parameters, use the unlink command. For instance, assume that we want to unlink the shape parameter across the partitions discussed above (after all, why should the standard data and the molecular data have the same distribution of rates across sites?). This would be achieved using

 unlink shape=(all);

If you unlink parameters by mistake, they can be linked again using the link command. All of the commands mentioned above and given as they would appear in a MrBayes block in a Nexus file can of course be entered from the command line as well (without the trailing semicolon). However, it is often more convenient to have them in either your data file or in a separate Nexus file that you process after you have read in your data. MrBayes will keep the data set in memory until you read in a new data block, so you can have your different MrBayes blocks pertaining to the same data file distributed over as many separate Nexus files as you like.

We recommend that, before you run your analysis, you check the current model settings using the showmodel command. This command will list all the active parameters and how they are linked across partitions, as well as the priors associated with each parameter.

Finally, we want to give you a warning. Even though MrBayes allows you to easily set up extremely complex and parameter-rich models, and the Bayesian MCMC approach is good at handling such models, think carefully about the parameters you introduce in your model. There should be at least some reasonable chance of estimating the parameters based on your data. For instance, a common mistake is to use a separate GTR model for a partition with so few substitutions that there is not a single observation for several rate categories. Making sure there are at least some observations allowing you to estimate each parameter is good practice. Over-parameterized models often result in problems with convergence in addition to the excessive variance seen in the parameter estimates.

Ancestral State Reconstruction

MrBayes allows you to infer ancestral states at ancestral nodes using the full hierarchical Bayesian approach (integrating out uncertainty concerning topology and other model parameters). The basic approach is described by Huelsenbeck and Bollback (2001) as well as in a recent review (Ronquist, 2004). You first need to constrain the node you want to infer ancestral states for using a constraint definition and the topologypr=constraints(...) command as described above for constrained topology models. Then ancestral state reconstruction is requested using report ancstates=yes.

The probability of each state will be printed to the .p file(s) under the heading p(<state_code>){character_number}. For instance, the probability of an A at site 215 in a nucleotide data set would be found under the heading p(A){215}. If you constrain several nodes in your data set, the node number will be given as well. If you had constrained two nodes, the probabilities of the above character would be distinguished as p(A){215@1} and p(A){215@2}. However, if you are interested in inferring ancestral states at two or more different nodes, we recommend running separate analyses, each constraining a single node. The reason is that when you focus on one node, you probably want to integrate over uncertainty in the rest of the tree, including the potential uncertainty concerning the presence of the other node(s).

Often, there is interest in mapping only one or a few characters onto trees inferred using largely other types of data. For instance, a behavioral or ecological trait may be mapped onto trees based on molecular data. To do this type of analysis in MrBayes, you would set up a mixed data set including both the character to be mapped and the data used to infer the phylogeny, with the character to be mapped in a separate data partition. How to do this is explained in the tutorial given in section 3 of this manual as well as in the description of partitioned models above. Typically, you also want to assume that the evolutionary rate for the mapped character is proportional to that of the other data (rather than identical). This is achieved by setting up a partitioned rate model using prset ratepr=variable. Then you need to set up a constraint for the node of interest, as described above. Finally, you request that ancestral states are inferred for the partition with the mapped character (there is no need to wade through ancestral state probabilities for the other partition(s)). For instance, if the character to be mapped is in partition 2, request ancestral state sampling using prset applyto=(2) ancstates=yes. Now only the ancestral states for the character of interest will be printed to the .p file(s). The sampled values can be summarized as usual with the sump command.

Personal tools