Advanced Topics

From MbWiki

Jump to: navigation, search

This is the manual for version 3.1. Please use the new version.

Contents

Advanced Topics

Compiling MrBayes

Compiling the MrBayes executable from the source code can be done on several different compilers targeting all the common operating systems: Macintosh, Windows, and Unix. The easiest way to build MrBayes is to use the included Makefile with a make tool. One can also compile MrBayes with the Metroworks Codewarrior and Microsoft Visual Studio suites.

Compiling with GNU Make

In the header of the Makefile, you can define a number of variables:

ARCHITECTURE. This variable defines the architecture you are targeting. Setting this variable is mandatory. For example, for a Unix environment you would use ARCHITECTURE = unix. Other options are windows and mac.

MPI. Set this variable to yes if you want to compile the parallel (MPI) version of MrBayes. This variable is set to no by default, meaning that a serial version of MrBayes will be built. See below for more information on how to compile and run the MPI version of MrBayes.

CC. This variable defines which compiler to use. For example, gcc for the GNU compiler or icc for the Intel C compiler. The default setting is the GNU compiler.

DEBUG. Set this variable to yes if you want to compile a debug version of MrBayes. This adds the appropriate flag for the GNU gdb debugger.

OPTFLAGS. Sets the optimization flags for the compiler. This option is ignored if DEBUG is set to yes. The default is set to -O3, which yields good results for every platform. It is, however, possible to perform some tuning with this variable. We give a few possibilities below for some common processor types, assuming you are using gcc version 3. See the gcc manual for further information on optimization flags.

  Intel x86. Some compiler flags for gcc under unix and for gcc/cygwin under windows:
  -march=X, with X one of pentium4, athlon-xp or opteron. If you have one of these processors this will generate instructions specifically tailored for that processor.
-mfpmath=sse attempts to use the SSE extension for numerical calculations. This flag is only effective in combination with the above mentioned -march flag. This flag can provide a big performance gain. However, using this flag in combination with other optimization flags might yield numerically incorrect code. For example, one can set -mfpmath=sse,386
    but this flag leads to incorrect results
when used in combination with -march=pentium4.
  -fomit-frame-pointer saves some function overhead. 
  -O3 instead of -O2 turns on even more optimization flags. However, it does not always produce faster code than -O2.
 Mac G4 and G5. Some compiler flags for gcc for OS X:
 -fast. This flag is specific for the gcc version delivered by Apple. It turns on a set of compiler flags which tries to optimize for maximum performance. This is the recommended setting if you have a G5 processor and this version of gcc. Code compiled with this flag will not run on a G4 processor.
 -O2 or -O3.
 -mcpu=X, with X one of G4 or G5.
 Setting -mcpu or -fast on the Mac results in gcc enabling a number of different flags. Read the gcc manual carefully if you want to experiment with other flags.

Compiling with Code Warrior or Visual Studio

We provide MrBayes project files for both Metrowerks Code Warrior and Microsoft Visual Studio in the source code package. All the relevant flags are set in these files, so you should be able to compile the code without any further modifications.

Compiling and Running the Parallel Version of MrBayes

Metropolis coupling or heating is well suited for parallelization. MrBayes 3 takes advantage of this and uses MPI to distribute heated and cold chains among available processors (Altekar et al., 2004). There are two MPI versions of MrBayes. The first is the parallel version for Macintosh computers distributed as part of the Macintosh package. It is intended for use on clusters of Macintosh computers and runs under POOCH, which must be installed first. The second MPI version of MrBayes is intended for use on clusters running UNIX and must be compiled from the source code.

The Parallel Macintosh Version

There are several options available for running jobs in parallel on clusters of Macintosh computers. For example, in OS X, you could configure your machine to run jobs using mpich or lam-mpi and then compile the regular Unix MPI version of the program as described in the next section. Another method is to use Dean Daugger's program Pooch (available at www.daugerresearch.com/pooch/whatis.html) to control the jobs. The Pooch web site gives a good description of the steps required to run a job in parallel. The steps are as follows:

  1. Configure a network of Macintosh computers. You have probably already done this step! You simply need more than one computer hooked to the internet.
  2. Buy and install a copy of Pooch for each computer you intend to run MrBayes on.
  3. Start Pooch on all of the computers of your cluster. If you set Pooch to automatically start on login, then this has already been done.
  4. Select 'New Job' from Pooch on one of the computers.
  5. Select the nodes (computers) you want to participate in the parallel job.
  6. Drag the 'MrBayes3.1.1p' application and the nexus file you wish to run to Pooch's Job Window.
  7. Click the 'Launch Job' button.

From this point on, MrBayes behaves just like the serial version of the program.

The MPI Version for Unix Clusters

The MPI version for Unix clusters, including Xserve clusters, has to be compiled before you can run it. To tell the compiler that you want the MPI version, you need to change a line in the top section (the configuration section) of the Makefile. The line originally reads:

MPI = no

Change this to:

MPI = yes

If your system is set up correctly, among other things you need to have mpicc with the relevant libraries installed, you should now be able to compile the MPI version of MrBayes. A typical make session would look as follows, after the Makefile has been appropriately edited:

 [ronquist@petal036 ~/mpi_mbdev]$ make
 mpicc -DUNIX_VERSION -DMPI_ENABLED -O3 -Wall -Wno-uninitialized -c bayes.c  
 mpicc -DUNIX_VERSION -DMPI_ENABLED -O3 -Wall -Wno-uninitialized -c comman  d.c
 mpicc -DUNIX_VERSION -DMPI_ENABLED -O3 -Wall -Wno-uninitialized -c mbmath.c
 mpicc -DUNIX_VERSION -DMPI_ENABLED -O3 -Wall -Wno-uninitialized -c mcmc.c
 mpicc -DUNIX_VERSION -DMPI_ENABLED -O3 -Wall -Wno-uninitialized -c model.c
 mpicc -DUNIX_VERSION -DMPI_ENABLED -O3 -Wall -Wno-uninitialized -c plot.c
 mpicc -DUNIX_VERSION -DMPI_ENABLED -O3 -Wall -Wno-uninitialized -c sump.c
 mpicc -DUNIX_VERSION -DMPI_ENABLED -O3 -Wall -Wno-uninitialized -c sumt.c
 mpicc -DUNIX_VERSION -DMPI_ENABLED -O3 -Wall -Wno-uninitialized bayes.o command.o mbmath.o mcmc.o model.o plot.o sump.o sumt.o -lm -o mb

and produces an MPI-enabled version of MrBayes called mb. Make sure that the mpicc compiler is invoked and that the MPI_ENABLED flag is set. It is perfectly normal if the build process stops for a few minutes on the mcmc.c file; this is the largest source file and it takes the compiler some time to optimize the code. How you run the resulting executable depends on the MPI implementation on your cluster. At FSU we typically run MrBayes using LAM/MPI. First, the LAM virtual machine is set up as usual with lamboot. Then the parallel MrBayes job is started with a line such as

$mpirun –np 4 mb batch.nex > log.txt &

to have MrBayes process the file batch.nex and run all analyses on four processors (-np 4), saving screen output to the file log.txt. If you keep both a serial and a parallel version of MrBayes on your system, make sure you are using the parallel version with your mpirun command.

Working with the Source Code

MrBayes 3 is written entirely in ANSI C. If you are interested in investigating or working with the source code, you can download the latest (bleeding edge) version from the MrBayes CVS repository at SourceForge. You can access the CVS repository from the MrBayes home page at [http://sourceforge.net/projects/mrbayes SourceForge (http://sourceforge.net/projects/mrbayes). SourceForge gives detailed instructions for anonymous access to the CVS repository on their documentation pages.

If you are interested in contributing code with bug fixes, the best way is to send a diff with respect to the most recent file versions in the CVS repository to Paul van der Mark (paulvdm[at]csit.fsu.edu), and we will include your fixes in the main development branch as soon as possible. If you would like to add functionality to MrBayes or improve some of the algorithms, please contact Paul for directions before you start any extensive work on your project to make sure your additions will be compatible with other ongoing development activities. You should also consider whether you want to work with version 3 or version 4 of the program. We are currently shifting our focus to the development of MrBayes 4. Unlike version 3, which is written in C, this version will be written in C++ and our goal is to provide a cleaner, faster, and more extensively documented implementation of Bayesian MCMC phylogenetic analysis. This means, among other things, that the code will be better organized, and all important sections will be documented using Doxygen for easy access to other developers. You are welcome to examine this project as it develops by downloading the source code, doxygen documentation, or programming style directives from the MrBayes CVS repository at SourceForge.

Advanced Options

LSet UseGibbs Option

As described in the Gamma-distributed rates section, MrBayes can accommodate rate heterogeneity across sites using a discrete approximation to the Gamma-distribution. The discrete approximation to the Gamma-distribution is a form of hidden Markov model, in which there are ngammacat rate categories that any site can belong two. There is a "hidden" state that identifies the appropriate rate category for each site. Because we cannot see the rate category for a site, we must treat it as an unknown variable. The LSet UseGibbs option controls how this variable is handled.

When LSet UseGibbs=no is in effect, the likelihoods calculated by MrBayes will be comparable to the likelihood from other software (and versions of MrBayes earlier than 3.2). In this mode, a weighted sum over all rate categories is performed to calculate each site's likelihood. The weights are the prior probabilities that a site would belong to that category. Because the gamma is discretized by breaking it up into ngammacat equal size chunks, the probability that any site i, would be in rate category c is simply 1/ngammacat (and this quantity is independent of the site of the rate). The likelihood for site i is a sum over all categories c of the quantity

(1/ngammacat)L(i | c)

where L(i | c) is the likelihood of site i conditional upon it being in category c. Typically (at least for datasets with a large number of taxa), one of the rate categories contributes the vast majority of the likelihood to this sum, because its conditional likelihood is much larger than the conditional likelihoods from the other rate categories. Thus the likelihoods under this model are close to 1/ngammacat times the conditional likelihood of the best fitting rate category for each site.

The fact that the site likelihoods are dominated by one term implies that it is wasteful to calculate the conditional likelihoods over all ngammacat rate categories --most of the calculations will be lost in the rounding error of the final summation over rate categories.

The LSet UseGibbs=yes option addresses this concern. Instead of calculating a site's likelihood by summing over all of the hidden states (the rate categories), an indicator variable that corresponds to the rate category for a site is sampled during the MCMC. The "likelihood" reported for each site is actually the conditional likelihood - the likelihood of the site conditional upon it belonging to rate category. Each site has its own value for the hidden state. The fact that we are uncertain of each site's "true" rate category is accommodated by sampling with MCMC instead of summation.

Conveniently, when you calculate the likelihood by summing over all rate categories, then you are also in position to calculate the posterior probability that each site belongs to a particular category. To update the hidden states for each site, you have to:

  1. Calculate the conditional likelihood for each rate category.
  2. Calculate the full site-likelihood by summing over all rate categories (weighted by their prior probabilities).
  3. Calculate the posterior probability of each of the hidden state assignments, by dividing the conditional likelihood for each rate category by the site likelihood (so that the probabilities sum to one).
  4. Select a new hidden state for each site by choosing a randomly -- giving each rate category a probability of being chosen that is equal to the posterior probability that the site belongs to that category

This is a Gibbs Sampling update of the hidden state. The LSet Gibbsfreq option controls how frequently the hidden states are updated. Thus the likelihood for all ngammacat categories only has to be calculated in the iteration in which the hidden state is being updated In all other iterations the likelihood is calculated from only the rate category to which that site is currently assigned.

Note that in the LSet UseGibbs=no mode, the priors associated with each rate category (the 1/ngammacat terms) are used in calculating the site likelihood. In the LSet UseGibbs=yes mode these priors are dealt with when rate categories are sampled. Thus, when using Gibbs Sampling the likelihood reported is not decreased by weighting it by a prior probability (thus the likelihoods that are reported will be higher).

Personal tools