Analyzing a Partitioned Data Set 3.2

From MbWiki

Revision as of 13:56, 14 May 2007 by Paulvdm (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Analyzing a Partitioned Data Set

MrBayes handles a wide variety of data types and models, as well as any mix of these models. In this example we will look at how to set up a simple analysis of a combined data set, consisting of data from four genes and morphology for 30 taxa of gall wasps and outgroups. A similar approach can be used, e.g., to set up a partitioned analysis of molecular data coming from different genes. The data set for this tutorial is found in the file cynmix.nex.

Getting Mixed Data into MrBayes

First, open up the Nexus data file in a text editor. The DATA block of the Nexus file should look familiar but there are some differences compared to the primates.nex file in the format statement:

  Format datatype=mixed(Standard:1-166,DNA:167-3246) interleave=yes gap=- missing=?;

First, the datatype is specified as datatype=mixed(Standard:1-166, DNA:167-3246). This means that the matrix contains standard (morphology) characters in columns 1-166 and DNA characters in the remaining columns. The mixed datatype is an extension to the Nexus ‘standard’. This extension was originated by MrBayes 3 and may not be compatible with other phylogenetics programs.

Second, the matrix is interleaved. It is often convenient to specify mixed data in interleaved format, with each block consisting of a natural subset of the matrix, such as the morphological data or one of the gene regions.

Dividing the Data into Partitions

By default, MrBayes partitions the data according to data type. There are only two data types in the matrix, so this model will include only a morphology (standard) and a DNA partition. To divide the DNA partition into gene regions, it is convenient to first specify character sets. In principle, this can be done from the command line but it is more convenient to do it in a MrBayes block in the data file. With the MrBayes distribution, we added a file cynmix-run.nex with a complete MrBayes block. For this section, we are going to create a command block from scratch, but you can consult the cynmix-run.nex for reference.

In your favorite text editor, create a new file called cynmix-command.nex in the same directory as the cynmix.nex file and add the following new MrBayes block (note that each line must be terminated by a semicolon):

 #NEXUS

 begin mrbayes;
   execute cynmix.nex;
   charset morphology = 1-166;
   charset COI = 167-1244;
   charset EF1a = 1245-1611;
   charset LWRh = 1612-2092;
   charset 28S = 2093-3246;

The first line is required to comply with the nexus standard. With the execute command, we load the data from the cynmix.nex file and the charset command simply associates a name with a set of characters. For instance, the character set COI is defined above to include characters 167 to 1244. The next step is to define a partition of the data according to genes and morphology. This is accomplished with the line (add it after the lines above):

   partition favored = 5: morphology, COI, EF1a, LWRh, 28S;

The elements of the partition command are: (1) the name of the partitioning scheme (favored); (2) an equal sign (=); (3) the number of character divisions in the scheme (5); (4) a colon (:); and (5) a list of the characters in each division, separated by commas. The list of characters can simply be an enumeration of the character numbers (the above line is equivalent to partition favored = 5: 1-166, 167-1244, 1245-1611, 1612-2092, 2093-3246;) but it is often more convenient to use predefined character sets like we did above. The final step is to tell MrBayes that we want to work with this partitioning of the data instead of the default partitioning. We do this using the set command:

  set partition = favored;

Finally, we need to add an end statement to close the MrBayes block. The entire file should now look like this:

  #NEXUS
 
  begin mrbayes;
   execute cynmix.nex;
   charset morphology = 1-166;
   charset COI = 167-1244;
   charset EF1a = 1245-1611;
   charset LWRh = 1612-2092;
   charset 28S = 2093-3246;
   partition favored = 5: morphology, COI, EF1a, LWRh, 28S;
   set partition = favored;
 end;

When we read this block into MrBayes, we will get a partitioned model with the first character division being morphology, the second division being the COI gene, etc. Save the data file, exit your text editor, and finally launch MrBayes and type execute cynmix-command.nex to read in your data and set up the partitioning scheme.

Specifying a Partitioned Model

Before starting to specify the partitioned model, it is useful to examine the default model. Type showmodel and you should get this table as part of the output:

  Active parameters:

                      Partition(s)
     Parameters       1  2  3  4  5
     ------------------------------
     Statefreq        1  2  2  2  2
     Topology         3  3  3  3  3
     Brlens           4  4  4  4  4
     ------------------------------

There is a lot of other useful information in the output of showmodel but this table is the key to the partitioned model. We can see that there are five partitions in the model and four active (free) parameters. There are two stationary state frequency parameters, one for the morphological data (parameter 1) and one for the DNA data (parameter 2). Then there is also a topology parameter (3) and a set of branch length parameters (4). Both the topology and branch lengths are the same for all partitions.

Now, assume we want a separate GTR + <math>\Gamma</math> + I model for each gene partition. All the parameters should be estimated separately for the individual genes. Assume further that we want the overall evolutionary rate to be (potentially) different across partitions, and that we want to assume gamma-shaped rate variation for the morphological data. We can obtain this model by using lset and prset with the applyto mechanism, which allows us to apply the settings to specific partitions. For instance, to apply a GTR + <math>\Gamma</math> + I model to the molecular partitions, we type lset applyto=(2,3,4,5) nst=6 rates=invgamma. This will produce the following table when showmodel is invoked:

  Active parameters:

                      Partition(s)
     Parameters       1  2  3  4  5
     ------------------------------
     Revmat           .  1  1  1  1
     Statefreq        2  3  3  3  3
     Shape            .  4  4  4  4
     Pinvar           .  5  5  5  5
     Topology         6  6  6  6  6
     Brlens           7  7  7  7  7
     ------------------------------

As you can see, all molecular partitions now evolve under the correct model but all parameters (statefreq, revmat, shape, pinvar) are shared across partitions. To unlink them such that each partition has its own set of parameters, type: unlink statefreq=(all) revmat=(all) shape=(all) pinvar=(all). Gamma-shaped rate variation for the morphological data is enforced with lset applyto=(1) rates=gamma. The trickiest part is to allow the overall rate to be different across partitions. This is achieved using the ratepr parameter of the prset command. By default, ratepr is set to fixed, meaning that all partitions have the same overall rate. By changing this to variable, the rates are allowed to vary under a flat Dirichlet prior (see the help info for prset if you want to modify this prior). To allow all our partitions to evolve under different rates, type prset applyto=(all) ratepr=variable.

The model is now essentially complete but there is one final thing to consider. Typically morphological data matrices do not include all types of characters. Specifically, morphological data matrices do not usually include any constant (invariable) characters. Sometimes, autapomorphies are not included either, and the matrix is restricted to parsimony-informative characters. For MrBayes to calculate the probability of the data correctly, we need to inform it of this ascertainment (coding) bias. By default, MrBayes assumes that standard data sets include all variable characters but no constant characters. If necessary, one can change this setting using lset coding. We will leave the coding setting at the default, though. Now, showmodel should produce this table:

  Active parameters:

                      Partition(s)
     Parameters       1  2  3  4  5
     ------------------------------
     Revmat           .  1  2  3  4
     Statefreq        5  6  7  8  9
     Shape           10 11 12 13 14
     Pinvar           . 15 16 17 18
     Ratemultiplier  19 19 19 19 19
     Topology        20 20 20 20 20
     Brlens          21 21 21 21 21
     ------------------------------

Running the Analysis

When the model has been completely specified, we can proceed with the analysis essentially as described above in the previous tutorial for the primates.nex data set. However, in the case of the cynmix.nex dataset, the analysis will have to be run longer before there is any hope of adequate convergence to the stationary distribution.

When looking at the parameter samples from a partitioned analysis, it is useful to know that the names of the parameters are followed by the character division (partition) number in curly braces. For instance, pi(A){3} is the stationary frequency of nucleotide A in character division 3, which is the EF1a division in the above analysis.

In this section we have used a separate nexus file for the MrBayes block. Although one can add this command block to the data file itself, there are a number of advantages of keeping the commands and the data blocks separate. For example, one can create a number of analysis with different parameters in a number of "command" files and submit all those files to a job scheduling system on a computer cluster.

Personal tools