This program was written by Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu December 2003 This program implements the algorithm BUILD-WITH-DISTANCES. It is given distance information in the file Build.input. It attempts to create a tree S which is additive with these distances. Not all a and b need to have l(a,b) given. Hence a natural way for such an input file to arise is from distances in various trees, when the desired tree is a supertree of the input trees. The input file is Build.input The output files are (1) Build.output This contains summary of all the results. (2) Build.outcode This contains summary information about the trees in a format that can be used for input. (3) Build.nex This nexus file can be used with PAUP* to draw the trees which have been generated. Format of Build.input: There are three types of input file allowed: (A) Input type 1: with the distances explicit: line 1: number_of_taxa 1 line 2: identification string terminated by the character * The identification string may occupy more than one line, but it may not contain an embedded character * successive lines: the codenumbers and names of the taxa (at most labellength characters terminated by *) in order with the code number (an integer at most maxleaves), then the name (blanks are allowed), then * The number of these lines should match number_of_taxa on line 1. successive lines: the distances in the form a b d(a,lub(a,b)) for example 1 2 6.57 to say the distance from taxon # 1 to lca(#1,taxon # 2) is 6.57 concluding line:0 0 0 This is a dummy line to indicate the end of the input. The same a and b may appear at most maxnumbersubtrees times. It need not have the same value in each case, but the values will be averaged. Lines with a = b are ignored. Blank lines may occur in the data Note here that d(a,lub(a,b)) need not equal d(b,lub(b,a)). These are the distances from a to the most recent common ancestor of a and b; or from b to the most recent common ancestor of a and b. Only in ultrametric situations are these necessarily the same. Example of input of type 1: 6 1 No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.* 1 human * 2 chimpanzee * 7 horse * 4 whale * 5 bat * 9 cow * 1 2 1.0000000 1 7 3.0000000 1 4 3.0000000 2 1 1.0000000 2 7 3.0000000 2 4 3.0000000 7 1 2.0000000 7 2 2.0000000 7 4 1.0000000 4 1 3.0000000 4 2 3.0000000 4 7 2.0000000 1 4 3.0000000 1 5 2.0000000 1 9 3.0000000 4 1 3.0000000 4 5 3.0000000 4 9 1.0000000 5 1 1.0000000 5 4 2.0000000 5 9 2.0000000 9 1 3.0000000 9 4 1.0000000 9 5 3.0000000 2 5 1.5000000 2 9 1.0000000 5 2 2.0000000 5 9 2.0000000 9 2 3.0000000 9 5 3.5000000 0 0 0.0 (B) Input type 2: with the distances implicit: line 1: number_of_taxa 2 line 2: identification string ending in * successive lines: the codenumber and the names of the taxa (at most labellength characters terminated by *) first subtree 5 17 63 81 82 83 0 the taxa in the first rooted tree, listed by code numbers, ended by 0 5 00.213the first rooted cluster ended by 0; followed by its branchlength to its parent 5 63 0 0.316 the second rooted cluster ended by 0, followed by its branchlength to its parent ... 0 0.0 a dummy ending the input of clusters for the first tree 17 81 83 85 89 0 the taxa in the second rooted tree, ended by 0 17 0 0.333the first rooted cluster ended by 0, followed by its branchlength to parent ... 0 0.0 a dummy ending the input of clusters for the second tree ... other rooted trees 0 0.0 dummy ending of the last rooted tree 0 dummy to end the input of trees Note that singleton clusters must be included in the input of each tree. Note that the dummies should include a numerical branchlength. Blank lines can make the divisions easier. 6 2 No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.* 1 human * 2 chimpanzee * 7 horse * 4 whale * 5 bat * 9 cow * 1 2 7 4 0 1 0 1 2 0 1 7 0 1 4 0 2 1 2 0 2 7 4 0 1 0 0 1 4 5 9 0 1 0 2 4 0 1 5 0 1 9 0 1 1 5 0 1 4 9 0 2 0 0 2 5 9 0 2 0 1 5 0 2 9 0 3 2 9 0 0.5 0 0 0 In the example above, the first input tree is described as 1 2 7 4 0 1 0 1 2 0 1 7 0 1 4 0 2 1 2 0 2 7 4 0 1 0 0 This means that the leaves of the first tree were 1, 2, 7, and 4 so the root of the first input tree was the cluster (1,2, 7, 4). The cluster (1) had branchlength 1 to its parent. The cluster (2) had branchlength 1 to its parent. The cluster (7) had branchlength 1 to its parent. The cluster (4) had branchlength 2 to its parent. The cluster (1,2) had branchlength 2 to its parent. The cluster (7,4) had branchlength 1 to its parent. There were no other clusters. For example, l(2,4) = 1+2 = 3 and l(7,4) = 1. This codifies the tree (1274) / \ (12) (74) / \ / \ (1) (2) (4) (7) in which branch lengths are indicated as follows in []: (1274) / [2] \[1] (12) (74) [1]/ \[1] [2]/ \[1] (1) (2) (4) (7) (C) Input type 3 via a nexus-like description using parentheses: line 1: number_of_taxa 3 line 2: identification string ending in * successive lines: the codenumber and the names of the taxa (at most labellength characters terminated by *) number_of_subtrees being input first subtree in parenthesis notation second subtree in parenthesis notation ... last subtree in parenthesis notation The parenthesis notation for the first tree above is: (2:1, ((5:1,9:1):2,4:2):1) Numbers after colons are the branchlength to the parent. Leaves are numbers like 9. Higher clusters are in parentheses. Thus this says that the nonsingleton clusters are (2,5,9,4), (5,9), (5,9,4). The length of the edge from 5 to its parent is 1. The length of the edge from (5,9) to its parent is 2. Example: 6 3 No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.* 1 human * 2 chimpanzee * 7 horse * 4 whale * 5 bat * 9 cow * 3 ((1:1,2:1):2,(7:1,4:2):1) ((1:2,5:1):1,(4:1,9:1):2) (5:2,(2:1,9:3):0.5) SAMPLE OUTPUT: Any of these input files will result in the following output: BEGINNING OF THE OUTPUT FILE Build.output: This program implements BUILD-WITH-DISTANCES. It computes rooted supertrees by distance methods that do not require all distances to be known. Distances are given by an lca-distance function l(a,b) where l(a,b) is the length of the path from a to the lowest common ancestor of a and b. Equivalently, l(a,b) = d(a, lca(a,b)). There are several different definitions of support in the various graphs. 1. Primary. There is an edge (ab) if either there is c such that l(a,c)-l(a,b) > threshold OR l(b,c)-l(b,a) > threshold. 2. Accumulated primary. There is an edge (ab) if the sum of the positive l(a,c)-l(a,b) and l(b,c)-l(b,a) exceeds the threshold. 3. Confirmed. There is an edge (ab) if there is c such that l(a,c)-l(a,b) > threshold AND l(b,c)-l(b,a) > threshold. 4. Accumulated confirmed. There is an edge (ab) if the sum of the positive minima of (l(a,c)-l(a,b) and l(b,c)-l(b,a)) exceeds the threshold. The input number of taxa is 6 Input via nexus-type tree descriptions. The input type is 3 The identification was: No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star. Taxa: 1 human 2 chimpanzee 7 horse 4 whale 5 bat 9 cow ---------------- The null threshold tree S(0) with primary support: List of clusters with root = cluster # 1: # 1: 1 2 7 4 5 9 Root The graph does not contain all singletons and is not a true tree. The null threshold tree S(0) with primary support: RMS L2 deviation: 2.30228 -------------- The minimal threshold tree Sap with accumulated primary support: List of clusters with root = cluster # 1: # 1: 1 2 7 4 5 9 Root # 2: 1 2 5 0 1 # 3: 7 4 9 0 1 # 4: 1 2 0 0.5 # 5: 5 0 1 # 6: 7 0 1 # 7: 4 9 0 1 # 8: 1 0 1 # 9: 2 0 1 # 10: 4 0 1 # 11: 9 0 1 Here is a parenthetic form of the graph: (((1,2),5),(7,(4,9))) Here is a parenthetic form with edge lengths: (((1:1,2:1):0.5,5:1):1,(7:1,(4:1,9:1):1):1) The minimal threshold tree Sap with accumulated primary support: RMS L2 deviation: 0.430228 Maximal threshold used was 4.75 -------------- The minimal threshold tree Sac with accumulated confirmed support: List of clusters with root = cluster # 1: # 1: 1 2 7 4 5 9 Root # 2: 1 2 5 0 1 # 3: 7 4 9 0 1 # 4: 1 2 0 0.5 # 5: 5 0 1 # 6: 7 0 1 # 7: 4 0 1 # 8: 9 0 1 # 9: 1 0 1 # 10: 2 0 1 Here is a parenthetic form of the graph: (((1,2),5),(7,4,9)) Here is a parenthetic form with edge lengths: (((1:1,2:1):0.5,5:1):1,(7:1,4:1,9:1):1) The minimal threshold tree Sac with accumulated confirmed support: RMS L2 deviation: 0.688155 Maximal threshold used was 0.25 -------------- The minimal threshold tree Sc with confirmed support: List of clusters with root = cluster # 1: # 1: 1 2 7 4 5 9 Root # 2: 1 2 5 0 1 # 3: 7 4 9 0 1 # 4: 1 2 0 0.5 # 5: 5 0 1 # 6: 7 0 1 # 7: 4 0 1 # 8: 9 0 1 # 9: 1 0 1 # 10: 2 0 1 Here is a parenthetic form of the graph: (((1,2),5),(7,4,9)) Here is a parenthetic form with edge lengths: (((1:1,2:1):0.5,5:1):1,(7:1,4:1,9:1):1) The minimal threshold tree Sc with confirmed support: RMS L2 deviation: 0.688155 Maximal threshold used was 0.25 -------------- The minimal threshold tree Sp with primary support: List of clusters with root = cluster # 1: # 1: 1 2 7 4 5 9 Root # 2: 1 2 4 5 9 0 1 # 3: 7 0 1 # 4: 1 2 4 9 0 0.25 # 5: 5 0 1 # 6: 1 0 1 # 7: 2 0 1 # 8: 4 0 1 # 9: 9 0 1 Here is a parenthetic form of the graph: (((1,2,4,9),5),7) Here is a parenthetic form with edge lengths: (((1:1,2:1,4:1,9:1):0.25,5:1):1,7:1) The minimal threshold tree Sp with primary support: RMS L2 deviation: 1.26909 Maximal threshold used was 2 -------------- Summary of the L2 deviations. The tree with the smallest L2 deviation has the best fit. Sap: 0.430228 Sac: 0.688155 Sc: 0.688155 Sp: 1.26909 The best tree is Sap. END OF THE OUTPUT FILE. INTERPRETING THIS OUTPUT: (1) In the minimal threshold tree Sap with accumulated primary support: The root is (127459) There is a cluster (125) and the length of the edge to its parent is 1. There is a cluster (749) and the length of the edge to its parent is 1. There is a cluster (12) and the length of the edge to its parent is 0.5. There is a cluster (5) and the length of the edge to its parent is 1. There is a cluster (7) and the length of the edge to its parent is 1. There is a cluster (49) and the length of the edge to its parent is 1. There is a cluster (1) and the length of the edge to its parent is 1. There is a cluster (5) and the length of the edge to its parent is 1. There is a cluster (4) and the length of the edge to its parent is 1. There is a cluster (9) and the length of the edge to its parent is 1. The minimal threshold tree Sap can be written using parenthesis notation as (((1,2),5),(7,(4,9))) When distances are included, the same tree is written as (((1:1,2:1):0.5,5:1):1,(7:1,(4:1,9:1):1):1) The input implied certain lca distances l(a,b) for (a,b) in a domain D. Once the tree Sap definition was computed, new lca distances l'(a,b) could be inferred. The RMS L2 error estimate is sqrt ( sum (l(a,b)-l'(a,b))^2) where the sum is over all (a,b) in D. For this tree, the RMS L2 deviation is 0.430228 In the computation of Sap, the largest threshold that was utilized was 4.75. The other trees can be interpreted analogously. (2) At the end of the output file is a summary of the L2 deviations for all four trees. In this case Sap had the smallest L2 deviation and hence appears superior to the other trees. OBTAINING PICTURES OF THE GRAPHS: The output file Build.nex is a nexus file that can be input to PAUP*. Then PAUP* commands such as SHOW TREES will give a variety of images of the trees. For the input files given above, the file Build.nex should appear as follows: BEGINNING OF NEXUS FILE #NEXUS [Additive rooted supertrees by BUILD-WITH-DISTANCES. No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star. Sap = minimal threshold tree with accumulated primary support Sac = minimal threshold tree with accumulated confirmed support Sc = minimal threshold tree with confirmed support Sp = minimal threshold tree with primary support ] BEGIN TAXA; DIMENSIONS NTAX = 6; TAXLABELS 1_human 2_chimpanzee 7_horse 4_whale 5_bat 9_cow ; END; BEGIN TREES; TREE Sap = (((1,2),5),(3,(4,6))); TREE Sap.dis = (((1:1,2:1):0.5,5:1):1,(3:1,(4:1,6:1):1):1); TREE Sac = (((1,2),5),(3,4,6)); TREE Sac.dis = (((1:1,2:1):0.5,5:1):1,(3:1,4:1,6:1):1); TREE Sc = (((1,2),5),(3,4,6)); TREE Sc.dis = (((1:1,2:1):0.5,5:1):1,(3:1,4:1,6:1):1); TREE Sp = (((1,2,4,6),5),3); TREE Sp.dis = (((1:1,2:1,4:1,6:1):0.25,5:1):1,3:1); END; END OF NEXUS FILE Note that the third taxon is identified as 3 rather than 7 because PAUP* does not accept code number identifications of the taxa. CURRENT CONSTRAINTS: The maximum number of taxa is maxleaves = 250; The maximum number of input trees is maxtrees = 100. The maximum length of an identification string is maxidlength = 1500. The maximum length of a taxon label is labellength = 100. These may be changed in the constant declarations. The identification string should not contain the symbol ] and must terminate with the symbol *