Tutorial 3.2

From MbWiki

Jump to: navigation, search


Tutorial: A Simple Analysis

This section is a tutorial based on the primates.nex data file. It will guide you through a basic Bayesian MCMC analysis of phylogeny, explaining the most important features of the program. There are two versions of the tutorial. You will first find a Quick-Start version for impatient users who want to get an analysis started immediately. The rest of the section contains a much more detailed description of the same analysis.

Quick Start Version

There are four steps to a typical Bayesian phylogenetic analysis using MrBayes:

  1. Read the Nexus data file
  2. Set the evolutionary model
  3. Run the analysis
  4. Summarize the samples

In more detail, each of these steps is performed as described in the following paragraphs.

1. At the MrBayes > prompt, type execute primates.nex. This will bring the data into the program. When you only give the data file name (primates.nex), the MrBayes program assumes that the file is in the current directory. If this is not the case, you have to use the full or relative path to your data file, for example execute ../taxa/primates.net. If you are running your own data file for this tutorial, beware that it may contain some MrBayes commands that can change the behavior of the program; delete those commands or put them in square brackets to follow this tutorial.

2. At the MrBayes > prompt, type lset nst=6 rates=invgamma. This sets the evolutionary model to the GTR model with gamma-distributed rate variation across sites and a proportion of invariable sites. If your data are not DNA or RNA, if you want to invoke a different model, or if you want to use non-default priors, refer to the rest of this manual, particularly sections 3 to 5 and the Appendix.

3.1. At the MrBayes > prompt, type mcmc ngen=10000 samplefreq=10. This will ensure that you get at least thousand samples from the posterior probability distribution. For larger data sets you probably want to run the analysis longer and sample less frequently (the default sampling frequency is every hundredth generation and the default numer of generations is one million). You can find the predicted remaining time to completion of the analysis in the last column printed to screen.

3.2. If the standard deviation of split frequencies is below 0.01 after 10,000 generations, stop the run by answering no when the program asks "Continue the analysis? (yes/no)". Otherwise, keep adding generations until the value falls below 0.01.

4.1. Summarize the parameter values by typing sump burnin=250 (or whatever value corresponds to 25 % of your samples). The program will output a table with summaries of the samples of the substitution model parameters, including the mean, mode, and 95 % credibility interval of each parameter. Make sure that the potential scale reduction factor (PSRF) is reasonably close to 1.0 for all parameters; if not, you need to run the analysis longer.

4.2. Summarize the trees by typing sumt burnin=250 (or whatever value corresponds to 25 % of your samples). The program will output a cladogram with the posterior probabilities for each split and a phylogram with mean branch lengths. The trees will also be printed to a file that can be read by tree drawing programs such as TreeView, MacClade, and Mesquite.

It does not have to be more complicated than this; however, as you get more proficient you will probably want to know more about what is happening behind the scenes. The rest of this section explains each of the steps in more detail and introduces you to all the implicit assumptions you are making and the machinery that MrBayes uses in order to perform your analysis.

Getting Data into MrBayes

To get data into MrBayes, you need a so-called Nexus file that contains aligned nucleotide or amino acid sequences, morphological ("standard") data, restriction site (binary) data, or any mix of these four data types. The Nexus file that we will use for this tutorial, primates.nex, contains 12 mitochondrial DNA sequences of primates. A Nexus file is a simple text (ASCII) file that begins with #NEXUS on the first line. The rest of the file is divided into different blocks. The primates.nex file looks like this:

begin data;
  dimensions ntax=12 nchar=898;
  format datatype=dna interleave=no gap=-;

The file contains only one block, a DATA block. The DATA block is initialized with begin data; followed by the dimensions statement, the format statement, and the matrix statement. The dimensions statement must contain ntax, the number of taxa, and nchar, the number of characters in each aligned sequence. The format statement must specify the type of data, for instance datatype=DNA (or RNA or Protein or Standard or Restriction). The format statement may also contain gap=- (or whatever symbol is used for a gap in your alignment), missing=? (or whatever symbol is used for missing data in your file), interleave=yes when the data matrix is interleaved sequences, and match=. (or whatever symbol is used for matching characters in the alignment). The format statement is followed by the matrix statement, usually started by the word matrix on a separate line, followed by the aligned sequences. Each sequence begins with the sequence name separated from the sequence itself by at least one space. The data block is completed by an end; statement. Note that the begin, dimensions, format, and end statements all end in a semicolon. That semicolon is essential and must not be left out. Note that, although it occupies many lines in the file, the matrix description is also a regular statement, a matrix statement, which ends with a semicolon just as any other statement. Our example file contains sequences in non-interleaved (block) format. Non-interleaved is the default but you can put interleave=no in the format statement if you want to be sure.

The Nexus data file can be generated by a program such as MacClade or Mesquite. Note, however, that MrBayes version 3 does not support the full Nexus standard, so you may have to do a little editing of the file for MrBayes to process it properly. In particular, MrBayes uses a fixed set of symbols for each data type and does not support the options of the upported symbols are {A, C, G, T, R, Y, M, K, S, W, H, B, V, D, N} for DNA data, {A, C, G, U, R, Y, M, K, S, W, H, B, V, D, N} for RNA data, {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V, X} for protein data, {0, 1} for restriction (binary) data, and {0, 1, 2, 3, 4, 5, 6, 5, 7, 8, 9} for standard (morphology) data. In addition to the standard one-letter ambiguity symbols for DNA and RNA listed above, ambiguity can also be expressed using the Nexus parenthesis or curly braces notation. For instance, a taxon polymorphic for states 2 and 3 can be coded as (23), (2,3), {23}, or {2,3} and a taxon with either amino acid A or F can be coded as (AF), (A,F), {AF} or {A,F}. Like most other statistical phylogenetics programs, MrBayes effectively treats polymorphism and uncertainty the same way (as uncertainty), so it does not matter whether you use parentheses or curly brackets. If you have other symbols in your matrix than the ones supported by MrBayes, you need to replace them before processing the data block in MrBayes. You also need to remove the "Equate" and "Symbols" statements in the "Format" line if they are included. Unlike the Nexus standard, MrBayes supports data blocks that contain mixed data types as described in section 3 of this manual.

To put the data into MrBayes type execute <filename> at the MrBayes > prompt, where <filename> is the name of the input file. To process our example file, type execute primates.nex or simply exe primates.nex to save some typing. Note that the input file must be located in the same folder (directory) where you started the MrBayes application (or else you will have to give the path to the file) and the name of the input file should not have blank spaces. If everything proceeds normally, MrBayes will acknowledge that it has read the data in the DATA block of the Nexus file by outputting the following information:

  Executing file "primates.nex"
  UNIX line termination
  Longest line length = 915
  Parsing file
  Expecting NEXUS formatted file
  Reading data block
     Allocated matrix
     Matrix has 12 taxa and 898 characters
     Data is Dna
     Data matrix is not interleaved
     Gaps coded as -
     Taxon  1 -> Tarsius_syrichta
     Taxon  2 -> Lemur_catta
     Taxon  3 -> Homo_sapiens
     Taxon  4 -> Pan
     Taxon  5 -> Gorilla
     Taxon  6 -> Pongo
     Taxon  7 -> Hylobates
     Taxon  8 -> Macaca_fuscata
     Taxon  9 -> M_mulatta
     Taxon 10 -> M_fascicularis
     Taxon 11 -> M_sylvanus
     Taxon 12 -> Saimiri_sciureus
     Successfully read matrix
     Setting default partition (does not divide up characters)
     Setting model defaults
     Setting output file names to "primates.nex.<run<i>.p/run<i>.t>"
  Exiting data block
  Reached end of file

Specifying a Model

All of the commands are entered at the MrBayes > prompt. At a minimum two commands, lset and prset, are required to specify the evolutionary model that will be used in the analysis. Usually, it is also a good idea to check the model settings prior to the analysis using the showmodel command. In general, lset is used to define the structure of the model and prset is used to define the prior probability distributions on the parameters of the model. In the following, we will specify a GTR + I + _ model (a General Time Reversible model with a proportion of invariable sites and a gamma-shaped distribution of rates across sites) for the evolution of the mitochondrial sequences and we will check all of the relevant priors. If you are unfamiliar with stochastic models of molecular evolution, we suggest that you consult a general text, such as Felsenstein (2004).

In general, a good start is to type help lset. Ignore the help information for now and concentrate on the table at the bottom of the output, which specifies the current settings. It should look like this:

  Model settings for partition 1:

  Parameter    Options                               Current Setting
  Nucmodel     4by4/Doublet/Codon                    4by4
  Nst          1/2/6                                 1
  Code         Universal/Vertmt/Mycoplasma/
               Yeast/Ciliates/Metmt                  Universal
  Ploidy       Haploid/Diploid                       Diploid
  Rates        Equal/Gamma/Propinv/Invgamma/Adgamma  Equal
  Ngammacat    <number>                              4
  Usegibbs     Yes/No                                No                         
  Gibbsfreq    <number>                              100                       
  Nbetacat     <number>                              5
  Omegavar     Equal/Ny98/M3                         Equal
  Covarion     No/Yes                                No
  Coding       All/Variable/Noabsencesites/
               Nopresencesites                       All
  Parsmodel    No/Yes                                No

First, note that the table is headed by Model settings for partition 1. By default, MrBayes divides the data into one partition for each type of data you have in your DATA block. If you have only one type of data, all data will be in a single partition by default. How to change the partitioning of the data will be explained in section 3 of the manual.

The Nucmodel setting allows you to specify the general type of DNA model. The Doublet option is for the analysis of paired stem regions of ribosomal DNA and the Codon option is for analyzing the DNA sequence in terms of its codons. We will analyze the data using a standard nucleotide substitution model, in which case the default 4by4 option is appropriate, so we will leave Nucmodel at its default setting.

The general structure of the substitution model is determined by the Nst setting. By default, all substitutions have the same rate (Nst=1), corresponding to the F81 model (or the JC model if the stationary state frequencies are forced to be equal using the prset command, see below). We want the GTR model (Nst=6 instead of the F81 model so we type lset nst=6. MrBayes should acknowledge that it has changed the model settings.

The Code setting is only relevant if the Nucmodel is set to Codon. The Ploidy setting is also irrelevant for us. However, we need to change the Rates setting from the default Equal (no rate variation across sites) to Invgamma (gamma-shaped rate variation with a proportion of invariable sites). Do this by typing lset rates=invgamma. Again, MrBayes will acknowledge that it has changed the settings. We could have changed both lset settings at once if we had typed lset nst=6 rates=invgamma in a single line.

We will leave the Ngammacat setting (the number of discrete categories used to approximate the gamma distribution) at the default of 4. In most cases, four rate categories are sufficient. It is possible to increase the accuracy of the likelihood calculations by increasing the number of rate categories. However, the time it will take to complete the analysis will increase in direct proportion to the number of rate categories you use, and the effects on the results will be negligible in most cases.

The default behaviour for the discrete gamma model of rate variation across sites is to sum site probabilities across rate categories. To sample those probabilities using a Gibbs sampler, we can set the Usegibbs setting to Yes. The Gibbs sampling approach is much faster and requires less memory, but it has some implications you have to be aware of. This option and the Gibbsfreq option will be discussed in the next section of this manual.

Of the remaining settings, it is only Covarion and Parsmodel that are relevant for single nucleotide models. We will use neither the parsimony model nor the covariotide model for our data, so we will leave these settings at their default values. If you type help lset now to verify that the model is correctly set, the table should look like this:

  Model settings for partition 1:

  Parameter    Options                               Current Setting
  Nucmodel     4by4/Doublet/Codon                    4by4
  Nst          1/2/6                                 6
  Code         Universal/Vertmt/Mycoplasma/
               Yeast/Ciliates/Metmt                  Universal
  Ploidy       Haploid/Diploid                       Diploid
  Rates        Equal/Gamma/Propinv/Invgamma/Adgamma  Invgamma
  Ngammacat    <number>                              4
  Usegibbs     Yes/No                                No                         
  Gibbsfreq    <number>                              100                       
  Nbetacat     <number>                              5
  Omegavar     Equal/Ny98/M3                         Equal
  Covarion     No/Yes                                No
  Coding       All/Variable/Noabsencesites/
               Nopresencesites                       All
  Parsmodel    No/Yes                                No

Setting the Priors

We now need to set the priors for our model. There are six types of parameters in the model: the topology, the branch lengths, the four stationary frequencies of the nucleotides, the six different nucleotide substitution rates, the proportion of invariable sites, and the shape parameter of the gamma distribution of rate variation. The default priors in MrBayes work well for most analyses, and we will not change any of them for now. By typing help prset you can obtain a list of the default settings for the parameters in your model. The table at the end of the help information reads:

  Model settings for partition 1:

  Parameter        Options                      Current Setting
  Tratiopr         Beta/Fixed                   Beta(1.0,1.0)
  Revmatpr         Dirichlet/Fixed              Dirichlet(1.0,1.0,1.0,1.0,1.0,1.0)
  Aamodelpr        Fixed/Mixed                  Fixed(Poisson)
  Aarevmatpr       Dirichlet/Fixed              Dirichlet(1.0,1.0,...)
  Omegapr          Dirichlet/Fixed              Dirichlet(1.0,1.0)
  Ny98omega1pr     Beta/Fixed                   Beta(1.0,1.0)
  Ny98omega3pr     Uniform/Exponential/Fixed    Exponential(1.0)
  M3omegapr        Exponential/Fixed            Exponential
  Codoncatfreqs    Dirichlet/Fixed              Dirichlet(1.0,1.0,1.0)
  Statefreqpr      Dirichlet/Fixed              Dirichlet(1.0,1.0,1.0,1.0)
  Shapepr          Uniform/Exponential/Fixed    Uniform(0.0,200.0)
  Ratecorrpr       Uniform/Fixed                Uniform(-1.0,1.0)
  Pinvarpr         Uniform/Fixed                Uniform(0.0,1.0)
  Covswitchpr      Uniform/Exponential/Fixed    Uniform(0.0,100.0)
  Symdirihyperpr   Uniform/Exponential/Fixed    Fixed(Infinity)
  Topologypr       Uniform/Constraints          Uniform
  Brlenspr         Unconstrained/Clock          Unconstrained:Exp(10.0)
  Treeheightpr     Exponential/Gamma            Exponential(1.0)
  Speciationpr     Uniform/Exponential/Fixed    Uniform(0.0,10.0)
  Extinctionpr     Uniform/Exponential/Fixed    Uniform(0.0,10.0)
  Sampleprob       <number>                     1.00
  Thetapr          Uniform/Exponential/Fixed    Uniform(0.0,10.0)
  Nodeagepr        Unconstrained/Calibrated     Unconstrained
  Treeagepr        Fixed/Uniform/
                   Offsetexponential            Fixed(1.00)
  Clockratepr      Strict/Cpp/Bm                Strict
  Cppratepr        Fixed/Exponential            Exponential(0.10)
  Psigammapr       Fixed/Exponential/Uniform    Fixed(1.00)
  Nupr             Fixed/Exponential/Uniform    Fixed(1.00)
  Ratepr           Fixed/Variable=Dirichlet     Fixed

We need to focus on Revmatpr (for the six substitution rates of the GTR rate matrix), Statefreqpr (for the stationary nucleotide frequencies of the GTR rate matrix), Shapepr (for the shape parameter of the gamma distribution of rate variation), Pinvarpr (for the proportion of invariable sites), Topologypr (for the topology), and Brlenspr (for the branch lengths).

The default prior probability density is a flat Dirichlet (all values are 1.0) for both Revmatpr and Statefreqpr. This is appropriate if we want estimate these parameters from the data assuming no prior knowledge about their values. It is possible to fix the rates and nucleotide frequencies but this is generally not recommended. However, it is occasionally necessary to fix the nucleotide frequencies to be equal, for instance in specifying the JC and SYM models. This would be achieved by typing prset statefreqpr=fixed(equal).

If we wanted to specify a prior that put more emphasis on equal nucleotide frequencies than the default flat Dirichlet prior, we could for instance use prset statefreqpr = Dirichlet(10,10,10,10) or, for even more emphasis on equal frequencies, prset statefreqpr=Dirichlet(100,100,100,100). The sum of the numbers in the Dirichlet distribution determines how focused the distribution is, and the balance between the numbers determines the expected proportion of each nucleotide (in the order A, C, G, and T). Usually, there is a connection between the parameters in the Dirichlet distribution and the observations. For example, you can think of a Dirichlet (150,100,90,140) distribution as one arising from observing 150 A's, 100 C's, 90 G's and 140 T's in some set of reference sequences. If your set of sequences is independent of those reference sequences, but this reference set is clearly relevant to the analysis of your sequences, it might be reasonable to use those numbers as a prior in your analysis.

In our analysis, we will be cautious and leave the prior on state frequencies at its default setting. If you have changed the setting according to the suggestions above, you need to change it back by typing prset statefreqpr=Dirichlet(1,1,1,1) or prs st = Dir(1,1,1,1) if you want to save some typing. Similarly, we will leave the prior on the substitution rates at the default flat Dirichlet(1,1,1,1,1,1) distribution.

The Shapepr parameter determines the prior for the <math>\alpha</math>(shape) parameter of the gamma distribution of rate variation. We will leave it at its default setting, a uniform distribution spanning a wide range of <math>\alpha</math>values. The prior for the proportion of invariable sites is set with Pinvarpr. The default setting is a uniform distribution between 0 and 1, an appropriate setting if we don't want to assume any prior knowledge about the proportion of invariable sites.

For topology, the default Uniform setting for the Topologypr parameter puts equal probability on all distinct, fully resolved topologies. The alternative is to constrain some nodes in the tree to always be present but we will not attempt that in this analysis.

The Brlenspr parameter can either be set to unconstrained or clock-constrained. For trees without a molecular clock (unconstrained) the branch length prior can be set either to exponential or uniform. The default exponential prior with parameter 10.0 should work well for most analyses. It has an expectation of 1/10 = 0.1 but allows a wide range of branch length values (theoretically from 0 to infinity). Because the likelihood values vary much more rapidly for short branches than for long branches, an exponential prior on branch lengths is closer to being uninformative than a uniform prior.

Checking the Model

To check the model before we start the analysis, type showmodel. This will give an overview of the model settings. In our case, the output will be as follows:

  Model settings:

        Datatype  = DNA
        Nucmodel  = 4by4
        Nst       = 6
                    Substitution rates, expressed as proportions
                    of the rate sum, have a Dirichlet prior
        Covarion  = No
        # States  = 4
                    State frequencies have a Dirichlet prior
        Rates     = Invgamma
                    Gamma shape parameter is uniformly dist-
                    ributed on the interval (0.00,200.00).
                    Proportion of invariable sites is uniformly dist-
                    ributed on the interval (0.00,1.00).
                    Gamma distribution is approximated using 4 categories.
                    Likelihood summarized over all rate categories in each generation.

  Active parameters:

     Revmat           1
     Statefreq        2
     Shape            3
     Pinvar           4
     Topology         5
     Brlens           6

     1 --  Parameter  = Revmat
           Type       = Rates of reversible rate matrix
           Prior      = Dirichlet(1.00,1.00,1.00,1.00,1.00,1.00)
     2 --  Parameter  = Pi
           Type       = Stationary state frequencies
           Prior      = Dirichlet
     3 --  Parameter  = Alpha
           Type       = Shape of scaled gamma distribution of site rates
           Prior      = Uniform(0.00,200.00)
     4 --  Parameter  = Pinvar
           Type       = Proportion of invariable sites
           Prior      = Uniform(0.00,1.00)
     5 --  Parameter  = Tau
           Type       = Topology
           Prior      = All topologies equally probable a priori
           Subparam.  = V
     6 --  Parameter  = V
           Type       = Branch lengths
           Prior      = Unconstrained:Exponential(10.0)

Note that we have six types of parameters in our model. All of these parameters will be estimated during the analysis (see section 5 for information on how to fix parameter values). To see more information about each parameter including it's starting value, one can use the showparams command.

Setting up the Analysis

The analysis is started by issuing the mcmc command. However, before doing this, we recommend that you review the run settings by typing help mcmc. The help mcmc command will produce the following table at the bottom of the output:

  Parameter       Options               Current Setting
  Seed            <number>              456025217                                     
  Swapseed        <number>              1179134547                                     
  Ngen            <number>              1000000                                      
  Nruns           <number>              2                                      
  Nchains         <number>              4                                      
  Temp            <number>              0.200000                                     
  Reweight        <number>,<number>     0.00 v 0.00 ^                       
  Swapfreq        <number>              1                                      
  Nswaps          <number>              1                                      
  Samplefreq      <number>              100                                      
  Printfreq       <number>              100                                      
  Printall        Yes/No                Yes                                      
  Printmax        <number>              8                                      
  Mcmcdiagn       Yes/No                Yes                                     
  Diagnfreq       <number>              1000                                      
  Diagnstat       Avgstddev/Maxstddev   Avgstddev                                      
  Minpartfreq     <number>              0.20                                  
  Allchains       Yes/No                No                                     
  Allcomps        Yes/No                No                                     
  Relburnin       Yes/No                Yes                                     
  Burnin          <number>              0                                      
  Burninfrac      <number>              0.25                                  
  Stoprule        Yes/No                No                                     
  Stopval         <number>              0.05                                  
  Savetrees       Yes/No                No                                     
  Checkpoint      Yes/No                No                                     
  Checkfreq       <number>              100000                                      
  Filename        <name>                primates.nex.<p/t>
  Startparams     Current/Reset         Current                                      
  Starttree       Current/Random        Current                                      
  Nperts          <number>              0                                      
  Data            Yes/No                Yes                                     
  Ordertaxa       Yes/No                No                                     

The Seed is simply the seed for the random number generator, and Swapseed is the seed for the separate random number generator used to generate the chain swapping sequence (see below). Unless they are set to user-specified values, these seeds are generated from the system clock, so your values are likely to be different from the ones in the screen dump above. The Ngen setting is the number of generations for which the analysis will be run. It is useful to run a small number of generations first to make sure the analysis is correctly set up and to get an idea of how long it will take to complete a longer analysis. We will start with 10,000 generations. To change the Ngen setting without starting the analysis we use the mcmcp command, which is equivalent to mcmc except that it does not start the analysis. Type mcmcp ngen=10000 to set the number of generations to 10,000. You can type help mcmc to confirm that the setting was changed appropriately.

By default, MrBayes will run two simultaneous, completely independent analyses starting from different random trees (Nruns = 2). Running more than one analysis simultaneously is very helpful in determining when you have a good sample from the posterior probability distribution, so we suggest that you leave this setting as is. The idea is to start each run from different randomly chosen trees. In the early phases of the run, the two runs will sample very different trees but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar.

To make sure that MrBayes compares tree samples from the different runs, make sure that Mcmcdiagn is set to yes and that Diagnfreq is set to some reasonable value, such as every 1000th generation. MrBayes will now calculate various run diagnostics every Diagnfreq generation and print them to a file with the name <Filename>.mcmc. The most important diagnostic, a measure of the similarity of the tree samples in the different runs, will also be printed to screen every Diagnfreq generation. Every time the diagnostics are calculated, either a fixed number of samples (burnin) or a percentage of samples (burninfrac) from the beginning of the chain is discarded. The relburnin setting determines whether a fixed burnin (relburnin=no) or a burnin percentage (relburnin=yes) is used. If you , MrBayes will by default discard the first 25 % samples from the cold chain (relburnin=yes and burninfrac=0.25).

By default, MrBayes uses Metropolis coupling to improve the MCMC sampling of the target distribution. The Swapfreq, Nswaps, Nchains, and Temp settings together control the Metropolis coupling behavior. When Nchains is set to 1, no heating is used. When Nchains is set to a value n larger than 1, then n - 1 heated chains are used. By default, Nchains is set to 4, meaning that MrBayes will use 3 heated chains and one "cold" chain. In our experience, heating is essential for problems with more than about 50 taxa, whereas smaller problems often can be analyzed successfully without heating. Adding more than three heated chains may be helpful in analyzing large and difficult data sets. The time complexity of the analysis is directly proportional to the number of chains used (unless MrBayes runs out of physical RAM memory, in which case the analysis will suddenly become much slower), but the cold and heated chains can be distributed among processors in a cluster of computers using the MPI version of the program (see below), greatly speeding up the calculations.

MrBayes uses an incremental heating scheme, in which chain <math>i</math> is heated by raising its posterior probability to the power <math>1/ (1 + i\lambda</math>), where <math>\lambda</math> is the temperature controlled by the Temp parameter. The effect of the heating is to flatten out the posterior probability, such that the heated chains more easily find isolated peaks in the posterior distribution and can help the cold chain move more rapidly between these peaks. Every Swapfreq generation, two chains are picked at random and an attempt is made to swap their states. For many analyses, the default settings should work nicely. If you are running many more than three heated chains, however, you may want to increase the number of swaps (Nswaps) that are tried each time the chain stops for swapping.

The Samplefreq setting determines how often the chain is sampled. By default, the chain is sampled every 100th generation, and this works well for most analyses. However, our analysis is so small that we are likely to get convergence quickly, so it makes sense to sample the chain more frequently, say every 10th generation (this will ensure that we get at least 1,000 samples when the number of generations is set to 10,000). To change the sampling frequency, type mcmcp samplefreq=10.

When the chain is sampled, the current values of the model parameters are printed to file. The substitution model parameters are printed to a .p file (in our case, there will be one file for each independent analysis, and they will be called primates.nex.run1.p and primates.nex.run2.p). The .p files are tab delimited text files that can be imported into most statistics and graphing programs. The topology and branch lengths are printed to a .t file (in our case, there will be two files called primates.nex.run1.t and primates.nex.run2.t). The .t files are Nexus tree files that can be imported into programs like PAUP* and TreeView. The root of the .p and .t file names can be altered using the Filename setting.

The Printfreq parameter controls the frequency with which the state of the chains is printed to screen. Leave Printfreq at the default value (print to screen every 100th generation).

The default behavior of MrBayes is to save trees with branch lengths to the .t file. Since this is what we want, we leave this setting as is. If you are running a large analysis (many taxa) and are not interested in branch lengths, you can save a considerable amount of disk space by not saving branch lengths.

The Startingtree parameter can be used to feed the chain(s) with a user-specified starting tree. The default behavior is to start each chain with a different random tree; this is recommended for general use.

Running the Analysis

Finally, we are ready to start the analysis. Type mcmc. MrBayes will first print information about the model and then list the proposal mechanisms that will be used in sampling from the posterior distribution. In our case, the proposals are the following:

  The MCMC sampler will use the following moves:
     With prob.  Chain will change
        5.26 %   param. 1 (Revmat) with Dirichlet proposal
        5.26 %   param. 2 (Pi) with Dirichlet proposal
        5.26 %   param. 3 (Alpha) with Multiplier
        5.26 %   param. 4 (Pinvar) with Sliding window
       78.95 %   param. 5 (Tau) and 6 (V) with Extending TBR

Note that MrBayes will spend most of its effort changing topology (Tau) and branch lengths (V) parameters. In our experience, topology and branch lengths are the most difficult parameters to integrate over and we therefore let MrBayes spend a large proportion of its time proposing new values for those parameters. The proposal probabilities can be changed with the propset command, but be warned that inappropriate changes of proposal probabilities may destroy any hopes of achieving convergence. After the initial log likelihoods, MrBayes will print the state of the chains every 100th generation, like this:

  Chain results:
      1 -- [-5723.498] (-5729.634) (-5727.207) (-5731.104) * [-5721.779] (-5731.701) (-5737.807) (-5730.336) 
    100 -- (-5726.662) (-5728.374) (-5733.144) [-5722.257] * [-5721.199] (-5726.193) (-5732.098) (-5732.563) -- 0:03:18
    200 -- [-5729.666] (-5721.116) (-5731.222) (-5731.546) * (-5726.632) [-5731.803] (-5738.420) (-5729.889) -- 0:02:27
    300 -- [-5727.654] (-5725.420) (-5736.655) (-5725.982) * (-5722.774) (-5743.637) (-5729.989) [-5729.954] -- 0:02:09
    400 -- [-5728.809] (-5722.467) (-5742.752) (-5729.874) * (-5723.731) (-5739.025) [-5719.889] (-5731.096) -- 0:02:24
    500 -- [-5728.286] (-5723.060) (-5738.274) (-5726.420) * [-5724.408] (-5733.188) (-5719.771) (-5725.882) -- 0:02:13
    600 -- [-5719.082] (-5728.268) (-5728.040) (-5731.023) * (-5727.788) (-5733.390) [-5723.994] (-5721.954) -- 0:02:05
    700 -- [-5717.720] (-5725.982) (-5728.786) (-5732.380) * (-5722.842) (-5727.218) [-5720.717] (-5729.936) -- 0:01:59
    800 -- (-5725.531) (-5729.259) (-5743.762) [-5731.019] * (-5729.238) [-5731.272] (-5722.135) (-5727.906) -- 0:02:06
    900 -- [-5721.976] (-5725.464) (-5731.774) (-5725.830) * (-5727.845) [-5723.992] (-5731.020) (-5728.988) -- 0:02:01
   1000 -- (-5724.549) [-5723.807] (-5726.810) (-5727.921) * (-5729.302) [-5730.518] (-5733.236) (-5727.348) -- 0:02:06
  Average standard deviation of split frequencies: 0.000000

   1100 -- [-5724.473] (-5726.013) (-5723.995) (-5724.521) * (-5734.206) (-5720.464) [-5727.936] (-5723.821) -- 0:02:01
   9000 -- (-5741.070) (-5728.937) (-5738.787) [-5719.056] * (-5731.562) [-5722.514] (-5721.184) (-5731.386) -- 0:00:13
  Average standard deviation of split frequencies: 0.000116
   9100 -- (-5747.669) [-5726.528] (-5738.190) (-5725.938) * (-5723.844) (-5726.963) [-5723.221] (-5724.665) -- 0:00:11
   9200 -- (-5738.994) (-5725.611) (-5734.902) [-5723.275] * [-5718.420] (-5724.197) (-5730.129) (-5724.800) -- 0:00:10
   9300 -- (-5740.946) (-5728.599) [-5729.193] (-5731.202) * (-5722.247) [-5723.141] (-5729.026) (-5727.039) -- 0:00:09
   9400 -- (-5735.178) (-5726.517) [-5726.557] (-5728.377) * (-5721.659) (-5723.202) (-5734.709) [-5726.191] -- 0:00:07
   9500 -- (-5731.041) (-5730.340) [-5721.900] (-5730.002) * (-5724.353) [-5727.075] (-5735.553) (-5725.420) -- 0:00:06
   9600 -- [-5726.318] (-5737.300) (-5725.160) (-5731.890) * (-5721.767) [-5730.250] (-5742.843) (-5725.866) -- 0:00:05
   9700 -- [-5726.573] (-5735.158) (-5728.509) (-5724.753) * (-5722.873) [-5729.740] (-5744.456) (-5723.282) -- 0:00:03
   9800 -- (-5728.167) (-5736.140) (-5729.682) [-5725.419] * (-5723.056) (-5726.630) (-5729.571) [-5720.712] -- 0:00:02
   9900 -- (-5738.486) (-5737.588) [-5732.250] (-5728.228) * (-5726.533) (-5733.696) (-5724.557) [-5722.960] -- 0:00:01
  10000 -- (-5729.797) (-5725.507) (-5727.468) [-5720.465] * (-5729.313) (-5735.121) (-5722.913) [-5726.844] -- 0:00:00
  Average standard deviation of split frequencies: 0.000105
  Continue with analysis? (yes/no): 

If you have the terminal window wide enough, each generation of the chain will print on a single line.

The first column lists the generation number. The following four columns with negative numbers each correspond to one chain in the first run. Each column corresponds to one physical location in computer memory, and the chains actually shift positions in the columns as the run proceeds. The numbers are the log likelihood values of the chains. The chain that is currently the cold chain has its value surrounded by square brackets, whereas the heated chains have their values surrounded by parentheses. When two chains successfully change states, they trade column positions (places in computer memory). If the Metropolis coupling works well, the cold chain should move around among the columns; this means that the cold chain successfully swaps states with the heated chains. If the cold chain gets stuck in one of the columns, then the heated chains are not successfully contributing states to the cold chain, and the Metropolis coupling is inefficient. The analysis may then have to be run longer or the temperature difference between chains may have to be lowered.

The star column separates the two different runs. The last column gives the time left to completion of the specified number of generations. This analysis approximately takes 1 second per 100 generations. Because different moves are used in each generation, the exact time varies somewhat for each set of 100 generations, and the predicted time to completion will be unstable in the beginning of the run. After a while, the predictions will become more accurate and the time will decrease predictably.

When to Stop the Analysis

At the end of the run, MrBayes asks whether or not you want to continue with the analysis. Before answering that question, examine the average standard deviation of split frequencies. As the two runs converge onto the stationary distribution, we expect the average standard deviation of split frequencies to approach zero, reflecting the fact that the two tree samples become increasingly similar. In our case, the average standard deviation is about 0.07 after 1,000 generations and then drops to less than 0.000001 towards the end of the run. Your values can differ slightly because of stochastic effects. Given the extremely low value of the average standard deviation at the end of the run, there appears to be no need to continue the analysis beyond 10,000 generations so when MrBayes asks "Continue with analysis? (yes/no):", stop the analysis by typing no.

Although we recommend using a convergence diagnostic, such as the standard deviation of split frequencies, to determine run length, we want to mention that there are simpler but less powerful methods of determining when to stop the analysis. Arguably the simplest technique is to examine the log likelihood values (or, more exactly, the log probability of the data given the parameter values) of the cold chain, that is, the values printed to screen within square brackets. In the beginning of the run, the values typically increase rapidly (the absolute values decrease, since these are negative numbers). This phase of the run is referred to as the "burn-in" and the samples from this phase are typically discarded. Once the likelihood of the cold chain stops to increase and starts to randomly fluctuate within a more or less stable range, the run may have reached stationarity, that is, it is producing a good sample from the posterior probability distribution. At stationarity, we also expect different, independent runs to sample similar likelihood values. Trends in likelihood values can be deceiving though; you're more likely to detect problems with convergence by comparing split frequencies than by looking at likelihood trends.

When you stop the analysis, MrBayes will print several types of information useful in optimizing the analysis. This is primarily of interest if you have difficulties in obtaining convergence. Since we apparently have a good sample from the posterior distribution already, we will ignore this information for now. We will return to the subject of optimizing the MCMC analysis in [[[FAQ 3.2|section 5]] of the manual.

Summarizing Samples of Substitution Model Parameters

During the run, samples of the substitution model parameters have been written to the .p files every samplefreq generation. These files are tab-delimited text files that look something like this:

 [ID: 9409050143]
 Gen     LnL        TL      r(A<->C)  ...  pi(G)    pi(T)     alpha     pinvar  
 1       -5723.498  3.357   0.067486  ...  0.098794 0.247609  0.580820  0.124136
 10      -5727.478  3.110   0.030604  ...  0.072965 0.263017  0.385311  0.045880
 9990    -5727.775  2.687   0.052292  ...  0.086991 0.224332  0.951843  0.228343
 10000   -5720.465  3.290   0.038259  ...  0.076770 0.240826  0.444826  0.087738

The first number, in square brackets, is a randomly generated ID number that lets you identify the analysis from which the samples come. The next line contains the column headers, and is followed by the sampled values. From left to right, the columns contain: (1) the generation number (Gen); (2) the log likelihood of the cold chain (LnL); (3) the total tree length (the sum of all branch lengths, TL); (4) the six GTR rate parameters (r(A<->C), r(A<->G) etc); (5) the four stationary nucleotide frequencies (pi(A), pi(C) etc); (6) the shape parameter of the gamma distribution of rate variation (alpha); and (7) the proportion of invariable sites (pinvar). If you use a different model for your data set, the .p files will of course be different.

MrBayes provides the sump command to summarize the sampled parameter values. Before using it, we need to decide on the burn-in. Since the convergence diagnostic we used previously to determine when to stop the analysis discarded the first 25 % of the samples and indicated that convergence had been reached after 10,000 generations, it makes sense to discard 25 % of the samples obtained during the first 10,000 generations. Since we sampled every 10th generation, there are 1,000 samples (1,001 to be exact, since the first generation is always sampled) and 25 % translates to 250 samples. Thus, summarize the information in the .p file by typing sump burnin=250. By default, sump will summarize the information in the .p file generated most recently, but the filename can be changed if necessary.

The sump command will first generate a plot of the generation versus the log probability of the data (the log likelihood values). If we are at stationarity, this plot should look like "white noise", that is, there should be no tendency of increase or decrease over time. The plot should look something like this:

  +------------------------------------------------------------+ -5718.96
  |                 2                           12             |
  |            2                                               |
  |              1 2         2                        2        |
  |             1    22    1*1   2 22        2   1      2      |
  |     2     2 2             1 2 2     2     2        1       |
  |       11      1      2        1  2   2 2           2 2   2 |
  |   1  2  1       1    12    1 1    *   2       2            |
  |  1 11 2  *    2       1        1        2  2      1       *|
  | *2      2      1   22              2  1        211    22   |
  |        2   1                    1  1    11 1    22  1      |
  |*     1       2            2      1        1 2          1*  |
  |                            2        1          1      1    |
  |           1       111  2    1        1 1                   |
  |   22             1                                       1 |
  |                                               1      1     |
  +------+-----+-----+-----+-----+-----+-----+-----+-----+-----+ -5729.82
  ^                                                            ^
  2500                                                         10000

If you see an obvious trend in your plot, either increasing or decreasing, you probably need to run the analysis longer to get an adequate sample from the posterior probability distribution.

At the bottom of the sump output, there is a table summarizing the samples of the parameter values:

  Model parameter summaries over the runs sampled in files
     "primates.nex.run1.p" and "primates.nex.run2.p":
     (Summaries are based on a total of 1502 samples from 2 runs)
     (Each run produced 1001 samples of which 751 samples were included)
                                             95% Cred. Interval
  Parameter       Mean        Variance       Lower         Upper         Median       PSRF *
  TL            2.954334      0.069985      2.513000      3.558000      2.941000      1.242
  r(A<->C)      0.044996      0.000060      0.030878      0.059621      0.044567      1.016
  r(A<->G)      0.470234      0.002062      0.386927      0.557040      0.468758      1.025
  r(A<->T)      0.038107      0.000073      0.023568      0.056342      0.037172      1.022
  r(C<->G)      0.030216      0.000189      0.007858      0.058238      0.028350      1.001
  r(C<->T)      0.396938      0.001675      0.317253      0.476998      0.394980      1.052
  r(G<->T)      0.019509      0.000158      0.001717      0.047406      0.018132      1.003
  pi(A)         0.355551      0.000150      0.332409      0.382524      0.357231      1.010
  pi(C)         0.320464      0.000131      0.298068      0.343881      0.320658      0.999
  pi(G)         0.081290      0.000043      0.067120      0.095940      0.080521      1.004
  pi(T)         0.242695      0.000101      0.220020      0.261507      0.243742      1.030
  alpha         0.608305      0.042592      0.370790      1.142317      0.546609      1.021
  pinvar        0.135134      0.007374      0.008146      0.303390      0.126146      0.999
  * Convergence diagnostic (PSRF = Potential scale reduction factor [Gelman
    and Rubin, 1992], uncorrected) should approach 1 as runs converge. The
    values may be unreliable if you have a small number of samples. PSRF should
    only be used as a rough guide to convergence since all the assumptions
    that allow one to interpret it as a scale reduction factor are not met in
    the phylogenetic context.

For each parameter, the table lists the mean and variance of the sampled values, the lower and upper boundaries of the 95 % credibility interval, and the median of the sampled values. The parameters are the same as those listed in the .p files: the total tree length (TL), the six reversible substitution rates (r(A<->C), r(A<->G), etc), the four stationary state frequencies (pi(A), pi(C), etc), the shape of the gamma distribution of rate variation across sites (alpha), and the proportion of invariable sites (pinvar). Note that the six rate parameters of the GTR model are given as proportions of the rate sum (the Dirichlet parameterization). This parameterization has some advantages in the Bayesian context; in particular, it allows convenient formulation of priors. If you want to scale the rates relative to the G-T rate, just divide all rate proportions by the G-T rate proportion.

The last column in the table contains a convergence diagnostic, the Potential Scale Reduction Factor (PSRF). If we have a good sample from the posterior probability distribution, these values should be close to 1.0. If you have a small number of samples, there may be some spread in these values, indicating that you may need to sample the analysis more often or run it longer. In our case, we can probably easily obtain more accurate estimates of some parameters by running the analysis slightly longer.

Summarizing Samples of Trees and Branch Lengths

Trees and branch lengths are printed to the .t files. These files are Nexus-formatted tree files with a structure like this:

 [ID: 9409050143]
 [Param: tree]
 begin trees;
      1 Tarsius_syrichta,
      2 Lemur_catta,
      3 Homo_sapiens,
      4 Pan,
      5 Gorilla,
      6 Pongo,
      7 Hylobates,
      8 Macaca_fuscata,
      9 M_mulatta,
     10 M_fascicularis,
     11 M_sylvanus,
     12 Saimiri_sciureus;
  tree rep.1 = ((12:0.486148,(((((3:0.042011,4:0.065025):0.034344,5:0.051939):0.079843,6:0.154472):0.071622,7:0.214803):0.140045,(11:0.047748,(10:0.062613,(8:0.011963,9:0.027989):0.039141):0.076611):0.320446):0.054798):0.348505,2:0.407786,1:0.619434);
  tree rep.10000 = (((((10:0.087647,(8:0.013447,9:0.021186):0.030524):0.056885,11:0.058483):0.284214,(7:0.162145,(((4:0.074703,3:0.054709):0.038621,5:0.079920):0.092362,6:0.147502):0.042049):0.145813):0.061064,12:0.550171):0.398401,2:0.386189,1:0.504229);

To summarize the tree and branch length information, type sumt burnin=250. The sumt and sump commands each have separate burn-in settings so it is necessary to give the burn-in here again. Otherwise, many of the settings in MrBayes are persistent and need not be repeated every time a command is executed. To make sure the settings for a particular command are correct, you can always use help <command> before issuing the command.

The sumt command will output, among other things, summary statistics for the taxon bipartitions, a tree with clade credibility (posterior probability) values, and a phylogram (if branch lengths have been saved). The summary statistics (see below) describes each partition in the rmat (dots for the taxa that are on one side of the partition and stars for the taxa on the other side; for instance, the sixth partition (ID 6) is the terminal branch leading to taxon 2 since it has a star in the second position and a dot in all other positions). Then it gives the number of times the partition was sampled (#obs), the probability of the partition (Probab.), the standard deviation of the partition frequency (Stdev(s)), the mean (Mean(v)) and variance (Var(v)) of the branch length, the Potential Scale Reduction Factor (PSRF), and finally the number of runs in which the partition was sampled (Nruns). In our analysis, there is overwhelming support for a single tree, so all partitions have a posterior probability of 1.0.

  Summary statistics for taxon bipartitions:
  ID -- Partition     #obs   Probab.  Stdev(s)   Mean(v)   Var(v)    PSRF  Nruns
   1 -- .......**...  1502  1.000000  0.000000  0.035937  0.000083  1.000    2
   2 -- .........*..  1502  1.000000  0.000000  0.056738  0.000148  1.006    2
   3 -- ........*...  1502  1.000000  0.000000  0.022145  0.000037  1.077    2
   4 -- ..........*.  1502  1.000000  0.000000  0.072380  0.000338  1.007    2
   5 -- .......*....  1502  1.000000  0.000000  0.017306  0.000037  1.036    2
   6 -- .*..........  1502  1.000000  0.000000  0.345552  0.003943  1.066    2
   7 -- .***********  1502  1.000000  0.000000  0.496361  0.006726  1.152    2
   8 -- ..**********  1502  1.000000  0.000000  0.273113  0.003798  1.021    2
   9 -- .......***..  1502  1.000000  0.000000  0.045900  0.000315  1.002    2
  10 -- .......****.  1502  1.000000  0.000000  0.258660  0.002329  1.041    2
  11 -- ..*.........  1502  1.000000  0.000000  0.049774  0.000110  1.014    2
  12 -- ...*........  1502  1.000000  0.000000  0.062863  0.000147  1.000    2
  13 -- .....*......  1502  1.000000  0.000000  0.146137  0.000665  1.060    2
  14 -- ...........*  1502  1.000000  0.000000  0.430463  0.004978  1.045    2
  15 -- ......*.....  1502  1.000000  0.000000  0.173405  0.000940  1.053    2
  16 -- ..***.......  1502  1.000000  0.000000  0.080733  0.000375  1.023    2
  17 -- ..****......  1502  1.000000  0.000000  0.055286  0.000409  1.064    2
  18 -- ..*****.....  1502  1.000000  0.000000  0.116993  0.001254  1.046    2
  19 -- ....*.......  1502  1.000000  0.000000  0.059082  0.000219  1.014    2
  20 -- ..*********.  1501  0.999334  0.000942  0.124653  0.001793  1.141    2
  21 -- ..**........  1500  0.998668  0.000000  0.030905  0.000135  1.030    2

The clade credibility tree (upper tree, next page) gives the probability of each partition or clade in the tree, and the phylogram (lower tree, next page) gives the branch lengths measured in expected substitutions per site.

  Clade credibility values:
  /--------------------------------------------------------- Tarsius_syrichta (1)
  |--------------------------------------------------------- Lemur_catta (2)
  |                                                /-------- Homo_sapiens (3)
  |                                        /--100--+                              
  |                                        |       \-------- Pan (4)
  |                                /--100--+                                      
  |                                |       \---------------- Gorilla (5)
  |                       /---100--+                                              
  +                       |        \------------------------ Pongo (6)
  |               /--100--+                                                       
  |               |       \--------------------------------- Hylobates (7)
  |               |                                                               
  |               |                                /-------- Macaca_fuscata (8)
  |       /--100--+                        /--100--+                              
  |       |       |                        |       \-------- M_mulatta (9)
  |       |       |                /--100--+                                      
  |       |       |                |       \---------------- M_fascicularis (10)
  \--100--+       \-------100------+                                              
          |                        \------------------------ M_sylvanus (11)
          \------------------------------------------------- Saimiri_sciureus (12)
  Phylogram (based on average branch lengths):
  /--------------------------------------- Tarsius_syrichta (1)
  |--------------------------- Lemur_catta (2)
  |                                                    /---- Homo_sapiens (3)
  |                                                  /-+                          
  |                                                  | \----- Pan (4)
  |                                           /------+                            
  |                                           |      \---- Gorilla (5)
  |                                       /---+                                   
  +                                       |   \------------ Pongo (6)
  |                              /--------+                                       
  |                              |        \------------- Hylobates (7)
  |                              |                                                
  |                              |                         /-- Macaca_fuscata (8)
  |                    /---------+                       /-+                      
  |                    |         |                       | \-- M_mulatta (9)
  |                    |         |                   /---+                        
  |                    |         |                   |   \---- M_fascicularis (10)
  \--------------------+         \-------------------+                            
                       |                             \------ M_sylvanus (11)
                       \---------------------------------- Saimiri_sciureus (12)
  |--------------| 0.200 expected changes per site

In the background, the sumt command creates three additional files. The first is a .parts file, which contains the list of taxon bipartitions, their posterior probability (the proportion of sampled trees containing them), and the branch lengths associated with them (if branch lengths have been saved). The branch length values are based only on those trees containing the relevant bipartition. The second generated file has the suffix .con and includes two consensus trees. The first one has both the posterior probability of clades (as interior node labels) and the branch lengths (if they have been saved) in its description. A graphical representation of this tree can be generated in the program TreeView by Rod Page. The second tree only contains the branch lengths and it can be imported into a wide range of tree-drawing programs such as MacClade and Mesquite. The third file generated by the sumt command is the .trprobs file, which contains the trees that were found during the MCMC search, sorted by posterior probability.

Personal tools