Instructions for use of the programs CreateT4k.c ErrorRectifyT4k.cp Thetaktree.c Replicate.c Sample data sets are also provided: CreateT4.data.type2 CreateT4.data.type3 Executable files for the Power Macintosh are also provided: These do not need to be compiled. CreateT4k.ex ErrorRectify.ex Thetaktree.ex Replicate.ex OVERVIEW: ErrorRectifyT4k.cp is written in C++. All other programs are written in C. They have worked on a Power Macintosh 7300/200. They were compiled using Codewarrior 9 or higher, available from Metrowerks. See http://www.metrowerks.com Program CreateT4k.c utilizes a data set consisting of the strings for various species. Its output is a file T4klist.CreateT4k which tells numerical indicators for all trees with exactly four species from the data set. The numerical indicators are equivalent to indications of the treelengths of each tree with exactly four species. The user may specify whether the criterion to be utilized is Maximum Parsimony or instead Higher Order Parsimony. Higher Order Parsimony makes a correction to Maximum Parsimony so as to reduce the effect of long-branch attraction. The program also produces a file CreateT4k.output containing information about the data set. For example, this file tells the beginnings of the data strings so that the user can verify that the data was read correctly. Program ErrorRectifyT4k takes a file T4klist.Thetaktree giving information about each quartet, and outputs a file T4klist.ErrorRectify which gives modified information about each quartet. The modification process rectifies errors in the original dataset, usually producing a stronger signal for the correct tree. The idea is that certain quartets imply phylogenetic information about other quartets. Thus, the program ErrorRectifyT4k modifies information about a quartet when the other quartets give a strong signal that the original quartet was incorrectly calculated. Program Thetaktree.c utilizes a file T4klist.Thetaktree in order to build automatically a good candidate for a phylogenetic tree. Such a file is obtained, for example, by renaming the output file T4klist.CreateT4k to become T4klist.Thetaktree. Alternatively, the output file T4klist.ErrorRectify of the program ErrorRectifyT4k may be renamed T4klist.Thetaktree. The tree is produced iteratively so as to have good measures of inconsistency with the trees on four species. Basically, it accepts as correct the trees with four species for which there is a large gap between the treelength of the maximum parsimony tree and the treelength of the runner-up topology. It utilizes these so as to try to minimize at each stage the local inconsistency measure I(T). Good values are negative and very large in absolute value (far from 0). Bad values are positive and large. When run in the automatic mode, species are selected in the order which maximizes the signal strength for adding species. Program Replicate.c also utilizes the file T4klist.Thetaktree such as was made by CreateT4k.c or ErrorRectifyT4k in order to build a good candidate for a phylogenetic tree. Rather than utilize signal strength to determine the order in which species are added, it performs many replicates to produce many trees. Then it summarizes the results and indicates a majority-rule consensus of the trees so obtained. For each replicate, it makes a random choice of the order in which to add species. Species are then added in that order utilizing local inconsistency measures to identify the best possible placement. To use the programs, (1) modify your data set so as to match the form of either file CreateT4.data.sample (2) rename your data file CreateT4.data (3) compile and run CreateT4k.c (4) look at the output file T4klist.CreateT4k and rename it T4klist.Thetaktree (5) If you wish to utilize the error rectification algorithm to strengthen the signal for the correct tree (if your list of quartets is fairly accurate), then run ErrorRectifyT4k. The output will be a new file T4klist.ErrorCorrect. Rename this file T4klist.Thetaktree. (If you do not wish to utilize error correction, then you should omit step 5.) (6) If you wish to utilize Thetaktree in order to add species either in the order suggested by signal strength or interactively, compile and run Thetaktree.c. Interpret your output in the file Thetaktree.output (7) If, instead, you wish to use program Replicate to produce many trees by adding species in a random order and summarizing the results, compile and run Replicate.c. Interpret your output in the file Replicate.output All programs except ErrorRectifyT4k are currently set to have a maximum of 40 species as indicated by an early line #define maxleaves 40. ErrorRectifyT4k is set to have a maximum of 30 species because of a larger demand for disk space. CreateT4k has a maximum of 13000 characters to a string, as indicated by an early line #define maxlength 13000 The user is warned, however, that the operating time and space requirements for CreateT4k are approximately O(L n^4) if there are n species and the strings have length L. The time requirement for Thetaktree is approximately O(n^5). For n large (for example, 30) the interactive option of Thetaktree will be substantially faster than the automatic option. The time requirement for Replicate is approximately O(r*n^4) where r is the number of replicates. It is probably faster than Thetaktree in the automatic mode for moderate values of r. For example, it may take about 3 minutes to do r = 100 replicates for n = 16 taxa on a PowerPC 7300/200. The ideas behind the programs may be found in two references: Willson, S.J. 1998. Measuring inconsistency in phylogenetic tree. Journal of Theoretical Biology 190: 15-36. Willson, S.J. 1999. Building phylogenetic trees from quartets by using local inconsistency measures, Molecular Biology and Evolution 16: 685-693. Willson, S.J. 1999. A higher-order parsimony method to reduce long-branch attraction, Molecular Biology and Evolution 16 : 694-705. Willson, S.J. 2000. An error-rectifying map for quartets can improve the signals for phylogenetic trees. Preprint. PROGRAM Thetaktree.c Thetaktree.c will first ask the user for the outgroup, to be entered via an integer . It will then come to a basic menu: Enter 1 to quit, doing slow summaries for the remaining species. Enter 2 to quit quickly, without slow computations for remaining species. Enter 3 to give the results of placements for all species and use one placement with strongest signal. Enter 4 to use strongest signal placements automatically until all possible species are placed. Enter 5 to choose a placement interactively. The user should enter an integer for the desired option. Both 1 and 2 will terminate the program. If the number of species is small, both options are comparably fast. If the number of species is large, the use of option 1 will take substantial time. Both choices result in the output of the tree as found so far, as well as various summaries. The program will display the message "Program has ended." The final termination of the program on some systems requires that the user go to a standard operating system menu to quit. This should be done only after having used option q or 8 since otherwise the tree found will not be in the output file. On other systems this last step will not be necessary. Option 4 corresponds to a completely automatic selection of the tree. The program successively finds the placements with the strongest signal strengths. Option 3 performs one step of option 4: it places only one species into the tree and then returns to the main menu. Option 4 is equivalent to repeatedly performing option 3. Option 5 permits the user to build the tree interactively. The computer will prompt the user for the next species to attempt to insert into the tree. It will then summarize the results of various placements. The user should recognize that small numbers are good, and negative numbers are best. Hence we recommend a placement with I = -3 as better than one with I = 5. In general, we recommend a placement with fewer inconsistent quartets over a placement with more inconsistent quartets. These two criteria do not necessarily agree in all cases, but when the evidence is strong they tend to agree. For example, after giving option 5 and indicating a wish to place species 1, the user might see the following: "We try to insert species 1 Possible placements for species 1 Placement 1 has Iabs= 26.000, badtreecount = 2 Placement 2 has Iabs= 32.000, badtreecount = 2 Placement 3 has Iabs= 34.000, badtreecount = 2 Placement 4 has Iabs= 34.000, badtreecount = 2 Placement 5 has Iabs= -25.000, badtreecount = 0 Placement 6 has Iabs= -25.000, badtreecount = 0 ** The placements 5 6 are equivalent and redundant. ** Lowest absolute inconsistency is at placements 5 6 with I= -25.000 and group signal strength 51.000 ** Lowest number of bad quartets is at placements 5 6 with count= 0 and group signal strength 2 Enter the placement number to use for placing species 1 or enter 0 to choose none." At this stage, the user will probably type 5 to select placement 5 because it is best by both criteria--it has the lowest inconsistency (Iabs) and the fewest inconsistent quartets (badtreecount). In this case the computer indicates that placements 5 and 6 are equivalent. Hence it is immaterial whether the user indicates 5 or 6 since the same tree would result. The use of placement 3, however, would result in a different tree with inconsistency 34 that is inconsistent with 2 quartets. If the user instead preferred not to make a choice of placement, then the user should enter 0 rather than any of the indicated placements. OUTPUT FILE Thetaktree.output The output file Thetaktree.output will first summarize the session yielding the tree. It will give the tree by identifying the nodes of the tree via a numerical label and by telling to what other nodes a given node is connected. It will then give the tree by a parenthesis notation showing the grouping of the taxa. It will then list all the quartets that are inconsistent with the final tree. Finally it also gives summary data such as the inconsistency of the resulting tree and the number of inconsistent quartets. Following is more detail about some of these summaries: (1) The description of the tree by identification of nodes contains lines like the following: "Node 4 It is a leaf for Flea This node reaches up to node 6 Node 5 It is a leaf for Scorpionfly This node reaches up to node 6 Node 6 This node reaches down to left and right nodes 4 and 5 This node reaches up to node 9" This indicates that in the final tree node 4 is a leaf corresponding to the species Flea (also identified as species number 4). Node number 6 is connected by edges to nodes numbers 4, 5 and 9. Such descriptions can easily be used to create a copy of the tree by hand. (2) The description of the final tree in parenthesis form might look like this: (1 3 (2 (4 5))) This tree indicates that species 4 and 5 are together. Their attaching node is then connected to species 2. This attaching node is attached to the root; the root is also attached to species 1 and species 3. (3) The list of quartets which are inconsistent with the final tree might look like the following. Tree ((1 4) (2 3)) has gap 32.000 It has site measures 42.000 16.000 10.000 while the tree matches 1 with 3. Tree ((1 5) (2 3)) has gap 30.000 It has site measures 40.000 15.000 10.000 while the tree matches 1 with 3. In this example, for the quartet {1,2,3,4}, the final tree matches species 1 with 3 and therefore would induce the tree ((1 3) (2 4)). Nevertheless, the data set indicates that the tree ((1 4) (2 3)) is more likely to be correct since there is a signal of strength 42 for it, which is 32 (the "gap") more than the signal for ((1 3) (2 4)). (4) The final summary might look like the following: The global inconsistency measure is 32.000. The number of inconsistent quartets is 2. This indicates that for the final tree the inconsistency is I = 32. Exactly 2 quartets are inconsistent with this tree. PROGRAM Replicate.c The program will list the taxa in order and then prompt the user to enter the outgroup. The user should enter the integer corresponding to the outgroup desired. Only a single taxon may be indicated. The program will indicate its starting tree (the tree for the quartet with the greatest strength). It will then ask the user to enter the desired number of replicates. The user should enter this number as an integer. The author's experience is that 100 is a good number of replicates for many data sets. Bear in mind that the running time will be proportional to the number of replicates indicated. The program will then indicate its progress. At the end, it will print out Program has ended. In some installations, the program will then quit immediately; in others, the user must utilize other means to quit the program. For example, on the Power Macintosh, one should then quit the program utilizing the Quit option from the File menu at the top of the screen. The output will be in the file Replicate.output OUTPUT FILE Replicate.output The file will list the labels for the species and will indicate the order of adding species for each replicate. There will then be a "Summary of the computer run:" For example one might see " Summary of the computer run: The run had 100 replications. We obtained 19 distinct trees. We started with the tree ((1 2) (8 9)) with inconsistency -846.000. To place species we would use the placement with lowest local absolute inconsistency with ties broken by having the fewest inconsistent quartets. Here is a summary of the trees found. Tree Frequency Iabs b average dist dist to consensus tree 1 21 25.000 148 2.7400 0 2 12 30.000 178 3.9200 4 3 24 23.000 172 3.1800 2 4 1 32.000 187 5.0000 4 5 5 41.000 223 7.7000 8 6 8 30.000 154 3.4800 2 7 5 35.000 170 4.8400 4 8 5 34.000 235 4.2600 2 9 1 41.000 247 8.1400 10 10 2 41.000 224 6.8400 6 11 5 25.000 168 3.7800 2 12 3 32.000 181 4.2600 2 13 1 32.000 205 4.7000 4 14 1 32.000 201 5.3000 4 15 1 41.000 234 8.1800 10 16 1 36.000 204 5.4800 4 17 1 36.000 203 6.3400 6 18 1 41.000 211 6.8800 6 19 2 41.000 210 7.7400 8 " This table indicates that out of 100 replicates, only 19 different trees were actually obtained. Tree number 1 resulted from 21 replicates, while tree number 2 resulted from 12 replicates. The absolute inconsistency Iabs of tree number 3 was 23; Iabs is a measure of inconsistency, and small numbers are good. Thus tree number 3 had the best measure of inconsistency among all the trees obtained. The number b indicates the number of quartets for which the preferred tree differed from that induced from the indicated tree. Thus there were 172 quartets for which tree number 3 induced a different tree from the preferred tree for that quartet. The average dist is the average of the Margush-McMorris metric distances from the indicated tree to all the replicate trees obtained. The dist to consensus tree is the Margush-McMorris metric distance from the indicated tree to the majority-rule consensus tree. The Margush-McMorris metric distance between T and U tells the sum of the number of bipartitions in T but not in U, plus the number of bipartitions in U but not in T. For more information, see T. Margush and F.R. McMorris. 1981. Consensus n-trees. Bulletin of Mathematical Biology 43: 239-244. Thus Tree number 3 had average distance 3.18 to the various replicates, and distance 2 from the consensus tree. Tree number 1 had distance 0 to the consensus tree and as a consequence in fact agrees with the consensus tree. Next, each tree is individually described: For example, one may see: "****Tree number 1 occurred with frequency 21 It has absolute inconsistency 25.000 and bad quartet count 148. It is first obtained as replica number 1 Its average distance to the replicates is 2.7400 Its distance to the consensus tree is 0 Following are the nontrivial partition sets containing the outgroup 15 1 *******--- --**** 2 -----**--- ---*** 3 -----***** ****** 4 ---------- ----** 5 *****--*** ***-** 6 *******--* -***** 7 *********- *-**** 8 ----****** ****** 9 *****--*** ****** 10 *******--* ****** 11 ---******* ****** 12 *******--- ---*** 13 --******** ****** Here is a description of the final tree in parenthesis form. (15 16 ((14 (6 7)) ((5 (4 (3 (1 2)))) (13 ((11 (8 9)) (10 12)))))) *************************" This describes tree number 1 in detail, giving both a parenthesis form and a list of the bipartitions. There follows a summary of all the bipartitions obtained in any tree together with their frequencies. Finally, the majority-rule consensus tree is written out. First, the bipartitions containing the outgroup are listed. Then, a description of the tree in parenthesis notation is given. For example, we might see: "Following are the nontrivial partition lists for the consensus tree frequency setsize complementsize 1 *******--- --**** 100 11 5 2 -----**--- ---*** 81 5 11 3 -----***** ****** 100 11 5 4 ---------- ----** 81 2 14 5 *****--*** ***-** 58 13 3 6 *******--* -***** 61 13 3 7 *********- *-**** 100 14 2 8 ----****** ****** 100 12 4 9 *****--*** ****** 100 14 2 10 *******--* ****** 100 14 2 11 ---******* ****** 100 13 3 12 *******--- ---*** 82 10 6 13 --******** ****** 100 14 2 Here is the consensus tree, written utilizing the outgroup. (15 16 ((14 (6 7 ))((5 (4 (3 (1 2 ))))(13 ((10 12 )(11 (8 9 )))))))" The second bipartition indicates that species {6,7,14,15,16} formed one bipartition, so that some edge in the consensus tree broke up the taxa into this set and its complement. This set has 5 elements, and its complement has 11 elements. This bipartition occurred in 81 of the 100 replicates. PROGRAM CreateT4k.c Program Create T4k.c utilizes a data set consisting of the strings for various species. Its output is a file T4klist.CreateT4k which gives measures of strengths for all binary trees with exactly four species from the data set. Use of this program is equivalent to the use of a method to select which of three possible topologies for each quartet is to be regarded as correct. The data are assumed aligned of a common length. Upper or lower case letters may be used. A gap may be indicated by either - or . Blanks are ignored. Any other symbol will cause an error. The input will be an input data file CreateT4.data containing the strings. The output will be another file T4klist.CreateT4k containing the list of quartets with various measures of strength. There is no count or limit on the number of quartets, nor is any count given in the output file. The program will initially print out a list of the species in question and a list of the first few characters in each data string. It will then ask the user to confirm some options: (1) It will print out: "This program is currently set to ignore sites for a quartet containing a gap. Enter 2 to reverse this, 1 to continue." If the user hits 1 then any single site for a given quartet containing for example, -,A,-,A in that same site for the four species will be ignored as giving no information. If the user hits 2 then that same site would be treated as giving some information just as though - were a separate symbol; in this example, it would be equivalent for this quartet to having the characters C,A,C,A in that site. (2) If the data consist of amino acid residues, the program will print out: "The data consisted of amino acids. Enter 1 to analyze the data via Maximum Parsimony for each quartet. Enter 2 to analyze the data via Grouped Higher Order Parsimony for each quartet. Grouped Higher Order Parsimony will reduce deceptive long-branch attractions." If the user hits 1 then the topology for each quartet is evaluated according to maximum parsimony. This method selects a tree with as few step changes as possible for the data set. If the user hits 2 then the topology for each quartet is evaluated according to "Grouped Higher Order Parsimony." In this method an estimate is made for how many steps in maximum parsimony arise because of long-branch attraction; this estimate is subtracted before the decision is made about the correct topology. If the data consist of nucleotides, the program will print out: "The data consisted of nucleotides. Enter 1 to analyze the data via Maximum Parsimony for each quartet. Enter 2 to analyze the data via Higher Order Parsimony for each quartet. Higher Order Parsimony will reduce deceptive long-branch attractions." If the user hits 1 then the topology for each quartet is evaluated according to maximum parsimony. This method selects a tree with as few step changes as possible for the data set. If the user hits 2 then the topology for each quartet is evaluated according to "Higher Order Parsimony." In this method an estimate is made for how many steps in maximum parsimony arise because of long-branch attraction; this estimate is subtracted before the decision is made about the correct topology. (3) The program will print out: "Do you want to create the T4klist? Enter 1 to do so, 2 to quit." The user should usually enter 1 to produce the T4klist. If the user believes that there has been an error, the user can save time by instead entering 2. (4) The computer will print out "Program has ended successfully." In some systems the user must now utilize a utility window to quit the program. In other systems, the program will end immediately. The T4klist will be in the file T4klist.CreateT4k Another file CreateT4k.output will summarize the session. FORM OF THE DATA SET CreateT4.data The data set CreateT4.data contains the strings for all taxa. The strings must already be aligned. No realignment is performed by the program CreateT4k. The first line of the data set has four numbers separated by spaces (a) the number of species; (b) the maximal length of the strings; (c) a flag which is 1 or 0; (d) an integer flag for datatype. The second line is a string with identification information. After these two lines come the data. For example, the first two lines might be: 16 3857 1 2 number species length of string, 1 to ignore gaps 0 to include Mitochondrial DNA. If the first flag equals 0 then gaps are equivalent to a new symbol, of distance 1 from anything beside a gap. If the first flag is 1, then a gap matches any other symbol. The second flag tells the form of the remaining data. The content of the remainder of the file depends on this datatype flag: datatype = 2: If there are interwoven pages, datatype = 2, then initially the lines consist of the species label (length 13) then the first 50 letters, for all the species; then a gap and the next 50 letters for all the species, etc. The program reads the first actuallength characters. It is assumed that the strings are already aligned. A sample data set of this format is provided in CreateT4.data.type2 datatype = 3: If datatype is 3, then there are as many additional lines as there are species. Each line starts with the species label, terminated by *. There follows the complete string, terminated by a line break. The program reads only the first actuallength nonblank characters or to the end of the line, whichever comes first. It is assumed that the strings are already aligned. A sample data set of this format is provided in CreateT4.data.type3 OUTPUT FILE T4klist.CreateT4k The output file T4klist.CreateT4k has the following form: (1) The first line consists of two numbers: (a) the number of taxa (b) either 0 or 1 ; 0 indicates that a gap was treated as a symbol while 1 indicates that for each quartet a site containing a gap was ignored (2) The second line is a string, usually indicating the source of the data (3) If there are N taxa, the next N lines contain the code number for each taxon together with a label for each taxon (4) The next lines contain information about each quartet. Each line has the form a b c d kabcd kacbd kadbc This line gives information about the three possible trees for the quartet {a,b,c,d}. kabcd gives a numerical measure of the strength of the evidence that the correct tree is ((ab)(cd)). kacbd gives a numerical measure of the evidence that the correct tree is ((ac)(bd)). kadbc gives a numerical measure of the evidence that the correct tree is ((ad)(bc)). The larger the numerical measure, the stronger the evidence. The species a,b,c,d are ordered in such a manner that kabcd >= kacbd and kabcd >= kadbc. (5) The last line is a code line 0 0 0 0 0 0 0 This indicates the end of the file. For example, here is the beginning of a file T4klist.CreateT4k 16 1 Data set from D'Erchia et al 1996 aligned by them. Datatype 2. Already aligned strings of length 3857 with 1953 informative positions. 1 pygmychimp 2 commonchim 3 human 4 gorilla 5 orangutan 6 mouse 7 rat 8 greyseal 9 harborseal 10 cow 11 horse 12 whale 13 rabbit 14 guineapig 15 opossum 16 hedgehog 1 2 3 4 45 6 8 1 2 3 5 35 6 12 1 2 3 6 30 7 10 ... 0 0 0 0 0 0 0 The file generated by CreateT4k is equivalent to the use of maximum parsimony to determine the correct tree for each quartet. For the line a b c d kabcd kacbd kadbc the number kabcd tells the number of sites in which taxa a and b have the same symbol, taxa c and d have the same symbol, and these two symbols are different. For example, a site in which species a has T, b has T, c has A, d has A contributes 1 to kabcd. Similarly the number kacbd tells the number of sites in which taxa a and c have the same symbol, taxa b and d have the same symbol, and these two symbols are different. The number kadbc tells the number of sites in which taxa a and d have the same symbol, taxa b and c have the same symbol, and these two symbols are different. SUGGESTIONS FOR USERS WHO DO NOT WISH TO UTILIZE MAXIMUM PARSIMONY The programs Thetaktree and Replicate will work utilizing any file of the form indicated above for T4klist.CreateT4k. The user may utilize any method he or she wants to prepare the file T4klist.Thetaktree, and then the use can run Thetaktree or Replicate to build the tree. For example, the user might prefer a particular model for evolution and wish to select the tree for each quartet according to the criterion of maximal likelihoods. In this case, the file should have the same form as above, but the lines a b c d kabcd kacbd kadbc would be produced in a different manner. For example, kabcd should be log(L((ab)(cd)) where L((ab)(cd)) denotes the likelihood of the tree ((ab) (cd)). A typical line might look like 2 4 3 5 -3.066383e+03 -3.067645e+03 -3.067243e+03 Here the logarithm of the likelihood of ((2 5) (4 3)) is -3.067243e+03. Note that the taxa are listed in the order 2,4,3,5 to indicate that ((2 4) (3 5)) has the strongest evidence for being correct. In this case, it is because the likelihood of ((2 4) (3 5)) is higher than the likelihood of the alternatives. PROGRAM ErrorRectifyT4k ErrorRectifyT4k has no interactions with the user. Its only input is a file named T4klist.Thetaktree such as above. The only output that is essential is a file T4klist.ErrorRectify which contains the revised list of quartets. Any quartets which have been changed are identified with the phrase REVERSED at the end of the line. In order to utilize this revised list, it should be renamed T4klist.Thetaktree before the user runs Replicate or Thetaktree. ErrorRectifyT4k also produces an output file ErrorRectify.output. This file summarizes the rectification steps that took place. For example, it tells the number of rectifying passes that took place (maximum of 50). It tells the number of quartets that were modified by the process.