#include #include #include #include #define boolean int /*This program was written by Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu December 2003 This program implements the algorithm BUILD-WITH-DISTANCES. It is given distance information in the file Build.input. It attempts to create a tree S which is additive with these distances. Not all a and b need to have l(a,b) given. Hence a natural way for such an input file to arise is from distances in various trees, when the desired tree is a supertree of the input trees. The input file is Build.input The output files are (1) Build.output This contains summary of all the results. (2) Build.outcode This contains summary information about the trees in a format that can be used for input. (3) Build.nex This nexus file can be used with PAUP* to draw the trees which have been generated. Format of Build.input: There are three types of input file allowed: (A) Input type 1: with the distances explicit: line 1: number_of_taxa 1 line 2: identification string terminated by the character * The identification string may occupy more than one line, but it may not contain an embedded character * successive lines: the codenumbers and names of the taxa (at most labellength characters terminated by *) in order with the code number (an integer at most maxleaves), then the name (blanks are allowed), then * The number of these lines should match number_of_taxa on line 1. successive lines: the distances in the form a b d(a,lub(a,b)) for example 1 2 6.57 to say the distance from taxon # 1 to lca(#1,taxon # 2) is 6.57 concluding line:0 0 0 This is a dummy line to indicate the end of the input. The same a and b may appear at most maxnumbersubtrees times. It need not have the same value in each case, but the values will be averaged. Lines with a = b are ignored. Blank lines may occur in the data Note here that d(a,lub(a,b)) need not equal d(b,lub(b,a)). These are the distances from a to the most recent common ancestor of a and b; or from b to the most recent common ancestor of a and b. Only in ultrametric situations are these necessarily the same. Example of input of type 1: 6 1 No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.* 1 human * 2 chimpanzee * 7 horse * 4 whale * 5 bat * 9 cow * 1 2 1.0000000 1 7 3.0000000 1 4 3.0000000 2 1 1.0000000 2 7 3.0000000 2 4 3.0000000 7 1 2.0000000 7 2 2.0000000 7 4 1.0000000 4 1 3.0000000 4 2 3.0000000 4 7 2.0000000 1 4 3.0000000 1 5 2.0000000 1 9 3.0000000 4 1 3.0000000 4 5 3.0000000 4 9 1.0000000 5 1 1.0000000 5 4 2.0000000 5 9 2.0000000 9 1 3.0000000 9 4 1.0000000 9 5 3.0000000 2 5 1.5000000 2 9 1.0000000 5 2 2.0000000 5 9 2.0000000 9 2 3.0000000 9 5 3.5000000 0 0 0.0 (B) Input type 2: with the distances implicit: line 1: number_of_taxa 2 line 2: identification string ending in * successive lines: the codenumber and the names of the taxa (at most labellength characters terminated by *) first subtree 5 17 63 81 82 83 0 the taxa in the first rooted tree, listed by code numbers, ended by 0 5 00.213the first rooted cluster ended by 0; followed by its branchlength to its parent 5 63 0 0.316 the second rooted cluster ended by 0, followed by its branchlength to its parent ... 0 0.0 a dummy ending the input of clusters for the first tree 17 81 83 85 89 0 the taxa in the second rooted tree, ended by 0 17 0 0.333the first rooted cluster ended by 0, followed by its branchlength to parent ... 0 0.0 a dummy ending the input of clusters for the second tree ... other rooted trees 0 0.0 dummy ending of the last rooted tree 0 dummy to end the input of trees Note that singleton clusters must be included in the input of each tree. Note that the dummies should include a numerical branchlength. Blank lines can make the divisions easier. 6 2 No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.* 1 human * 2 chimpanzee * 7 horse * 4 whale * 5 bat * 9 cow * 1 2 7 4 0 1 0 1 2 0 1 7 0 1 4 0 2 1 2 0 2 7 4 0 1 0 0 1 4 5 9 0 1 0 2 4 0 1 5 0 1 9 0 1 1 5 0 1 4 9 0 2 0 0 2 5 9 0 2 0 1 5 0 2 9 0 3 2 9 0 0.5 0 0 0 In the example above, the first input tree is described as 1 2 7 4 0 1 0 1 2 0 1 7 0 1 4 0 2 1 2 0 2 7 4 0 1 0 0 This means that the leaves of the first tree were 1, 2, 7, and 4 so the root of the first input tree was the cluster (1,2, 7, 4). The cluster (1) had branchlength 1 to its parent. The cluster (2) had branchlength 1 to its parent. The cluster (7) had branchlength 1 to its parent. The cluster (4) had branchlength 2 to its parent. The cluster (1,2) had branchlength 2 to its parent. The cluster (7,4) had branchlength 1 to its parent. There were no other clusters. For example, l(2,4) = 1+2 = 3 and l(7,4) = 1. This codifies the tree (1274) / \ (12) (74) / \ / \ (1) (2) (4) (7) in which branch lengths are indicated as follows in []: (1274) / [2] \[1] (12) (74) [1]/ \[1] [2]/ \[1] (1) (2) (4) (7) (C) Input type 3 via a nexus-like description using parentheses: line 1: number_of_taxa 3 line 2: identification string ending in * successive lines: the codenumber and the names of the taxa (at most labellength characters terminated by *) number_of_subtrees being input first subtree in parenthesis notation second subtree in parenthesis notation ... last subtree in parenthesis notation The parenthesis notation for the first tree above is: (2:1, ((5:1,9:1):2,4:2):1) Numbers after colons are the branchlength to the parent. Leaves are numbers like 9. Higher clusters are in parentheses. Thus this says that the nonsingleton clusters are (2,5,9,4), (5,9), (5,9,4). The length of the edge from 5 to its parent is 1. The length of the edge from (5,9) to its parent is 2. Example: 6 3 No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.* 1 human * 2 chimpanzee * 7 horse * 4 whale * 5 bat * 9 cow * 3 ((1:1,2:1):2,(7:1,4:2):1) ((1:2,5:1):1,(4:1,9:1):2) (5:2,(2:1,9:3):0.5) SAMPLE OUTPUT: Any of these input files will result in the following output: BEGINNING OF THE OUTPUT FILE Build.output: This program implements BUILD-WITH-DISTANCES. It computes rooted supertrees by distance methods that do not require all distances to be known. Distances are given by an lca-distance function l(a,b) where l(a,b) is the length of the path from a to the lowest common ancestor of a and b. Equivalently, l(a,b) = d(a, lca(a,b)). There are several different definitions of support in the various graphs. 1. Primary. There is an edge (ab) if either there is c such that l(a,c)-l(a,b) > threshold OR l(b,c)-l(b,a) > threshold. 2. Accumulated primary. There is an edge (ab) if the sum of the positive l(a,c)-l(a,b) and l(b,c)-l(b,a) exceeds the threshold. 3. Confirmed. There is an edge (ab) if there is c such that l(a,c)-l(a,b) > threshold AND l(b,c)-l(b,a) > threshold. 4. Accumulated confirmed. There is an edge (ab) if the sum of the positive minima of (l(a,c)-l(a,b) and l(b,c)-l(b,a)) exceeds the threshold. The input number of taxa is 6 Input via nexus-type tree descriptions. The input type is 3 The identification was: No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star. Taxa: 1 human 2 chimpanzee 7 horse 4 whale 5 bat 9 cow ---------------- The null threshold tree S(0) with primary support: List of clusters with root = cluster # 1: # 1: 1 2 7 4 5 9 Root The graph does not contain all singletons and is not a true tree. The null threshold tree S(0) with primary support: RMS L2 deviation: 2.30228 -------------- The minimal threshold tree Sap with accumulated primary support: List of clusters with root = cluster # 1: # 1: 1 2 7 4 5 9 Root # 2: 1 2 5 0 1 # 3: 7 4 9 0 1 # 4: 1 2 0 0.5 # 5: 5 0 1 # 6: 7 0 1 # 7: 4 9 0 1 # 8: 1 0 1 # 9: 2 0 1 # 10: 4 0 1 # 11: 9 0 1 Here is a parenthetic form of the graph: (((1,2),5),(7,(4,9))) Here is a parenthetic form with edge lengths: (((1:1,2:1):0.5,5:1):1,(7:1,(4:1,9:1):1):1) The minimal threshold tree Sap with accumulated primary support: RMS L2 deviation: 0.430228 Maximal threshold used was 4.75 -------------- The minimal threshold tree Sac with accumulated confirmed support: List of clusters with root = cluster # 1: # 1: 1 2 7 4 5 9 Root # 2: 1 2 5 0 1 # 3: 7 4 9 0 1 # 4: 1 2 0 0.5 # 5: 5 0 1 # 6: 7 0 1 # 7: 4 0 1 # 8: 9 0 1 # 9: 1 0 1 # 10: 2 0 1 Here is a parenthetic form of the graph: (((1,2),5),(7,4,9)) Here is a parenthetic form with edge lengths: (((1:1,2:1):0.5,5:1):1,(7:1,4:1,9:1):1) The minimal threshold tree Sac with accumulated confirmed support: RMS L2 deviation: 0.688155 Maximal threshold used was 0.25 -------------- The minimal threshold tree Sc with confirmed support: List of clusters with root = cluster # 1: # 1: 1 2 7 4 5 9 Root # 2: 1 2 5 0 1 # 3: 7 4 9 0 1 # 4: 1 2 0 0.5 # 5: 5 0 1 # 6: 7 0 1 # 7: 4 0 1 # 8: 9 0 1 # 9: 1 0 1 # 10: 2 0 1 Here is a parenthetic form of the graph: (((1,2),5),(7,4,9)) Here is a parenthetic form with edge lengths: (((1:1,2:1):0.5,5:1):1,(7:1,4:1,9:1):1) The minimal threshold tree Sc with confirmed support: RMS L2 deviation: 0.688155 Maximal threshold used was 0.25 -------------- The minimal threshold tree Sp with primary support: List of clusters with root = cluster # 1: # 1: 1 2 7 4 5 9 Root # 2: 1 2 4 5 9 0 1 # 3: 7 0 1 # 4: 1 2 4 9 0 0.25 # 5: 5 0 1 # 6: 1 0 1 # 7: 2 0 1 # 8: 4 0 1 # 9: 9 0 1 Here is a parenthetic form of the graph: (((1,2,4,9),5),7) Here is a parenthetic form with edge lengths: (((1:1,2:1,4:1,9:1):0.25,5:1):1,7:1) The minimal threshold tree Sp with primary support: RMS L2 deviation: 1.26909 Maximal threshold used was 2 -------------- Summary of the L2 deviations. The tree with the smallest L2 deviation has the best fit. Sap: 0.430228 Sac: 0.688155 Sc: 0.688155 Sp: 1.26909 The best tree is Sap. END OF THE OUTPUT FILE. INTERPRETING THIS OUTPUT: (1) In the minimal threshold tree Sap with accumulated primary support: The root is (127459) There is a cluster (125) and the length of the edge to its parent is 1. There is a cluster (749) and the length of the edge to its parent is 1. There is a cluster (12) and the length of the edge to its parent is 0.5. There is a cluster (5) and the length of the edge to its parent is 1. There is a cluster (7) and the length of the edge to its parent is 1. There is a cluster (49) and the length of the edge to its parent is 1. There is a cluster (1) and the length of the edge to its parent is 1. There is a cluster (5) and the length of the edge to its parent is 1. There is a cluster (4) and the length of the edge to its parent is 1. There is a cluster (9) and the length of the edge to its parent is 1. The minimal threshold tree Sap can be written using parenthesis notation as (((1,2),5),(7,(4,9))) When distances are included, the same tree is written as (((1:1,2:1):0.5,5:1):1,(7:1,(4:1,9:1):1):1) The input implied certain lca distances l(a,b) for (a,b) in a domain D. Once the tree Sap definition was computed, new lca distances l'(a,b) could be inferred. The RMS L2 error estimate is sqrt ( sum (l(a,b)-l'(a,b))^2) where the sum is over all (a,b) in D. For this tree, the RMS L2 deviation is 0.430228 In the computation of Sap, the largest threshold that was utilized was 4.75. The other trees can be interpreted analogously. (2) At the end of the output file is a summary of the L2 deviations for all four trees. In this case Sap had the smallest L2 deviation and hence appears superior to the other trees. OBTAINING PICTURES OF THE GRAPHS: The output file Build.nex is a nexus file that can be input to PAUP*. Then PAUP* commands such as SHOW TREES will give a variety of images of the trees. For the input files given above, the file Build.nex should appear as follows: BEGINNING OF NEXUS FILE #NEXUS [Additive rooted supertrees by BUILD-WITH-DISTANCES. No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star. Sap = minimal threshold tree with accumulated primary support Sac = minimal threshold tree with accumulated confirmed support Sc = minimal threshold tree with confirmed support Sp = minimal threshold tree with primary support ] BEGIN TAXA; DIMENSIONS NTAX = 6; TAXLABELS 1_human 2_chimpanzee 7_horse 4_whale 5_bat 9_cow ; END; BEGIN TREES; TREE Sap = (((1,2),5),(3,(4,6))); TREE Sap.dis = (((1:1,2:1):0.5,5:1):1,(3:1,(4:1,6:1):1):1); TREE Sac = (((1,2),5),(3,4,6)); TREE Sac.dis = (((1:1,2:1):0.5,5:1):1,(3:1,4:1,6:1):1); TREE Sc = (((1,2),5),(3,4,6)); TREE Sc.dis = (((1:1,2:1):0.5,5:1):1,(3:1,4:1,6:1):1); TREE Sp = (((1,2,4,6),5),3); TREE Sp.dis = (((1:1,2:1,4:1,6:1):0.25,5:1):1,3:1); END; END OF NEXUS FILE Note that the third taxon is identified as 3 rather than 7 because PAUP* does not accept code number identifications of the taxa. CURRENT CONSTRAINTS: The maximum number of taxa is maxleaves = 250; The maximum number of input trees is maxtrees = 100. The maximum length of an identification string is maxidlength = 1500. The maximum length of a taxon label is labellength = 100. These may be changed in the constant declarations. The identification string should not contain the symbol ] and must terminate with the symbol * In case the program appears to be in an infinite loop, one can abort by simultaneously pressing c or possibly q If excess space is required, one can delete all references to the large array inputtriples at the cost of losing a topological test of fitness to the data. */ const int maxidlength = 1500; const int maxleaves = 145; // maximum number of taxa const int labellength = 100; const int maxedges = 2*maxleaves; const int queue = 3; //wait doing bisection before trying the lower bound ofstream outcomment, outnexus, outcodetreelist; // output files ifstream indistance; // input data file int inputtype; // {1 for distances, 2 for treedash format} int idlength; //{length of id string} char idstring[maxidlength];//identification string boolean inputdone, treedone; int treecount; // {the number of input trees} int inclustercount;//{the number of clusters in the ith tree} float branchlength[maxedges];//{branchlength to the parent} boolean inclusterdone[maxedges]; int taxonidnumber[maxleaves]; // {the id number // this is of use in preparing distances for various subtrees separately} // reversetaxonidnumber: array[1..maxleaves] of integer; {the actualleaf // identification, so reversetaxonidnumber[taxonidnumber[i]] = i int reversetaxonidnumber[maxleaves] ; // {the actualleaf // identification, so reversetaxonidnumber[taxonidnumber[i]] = i int actualleaves; // number of taxa char labels[maxleaves][labellength]; int actuallabellength; int counter[maxleaves]; // {the counts of indices} float distance[maxleaves][maxleaves]; // {distance[a,b] is the ith value of a distance between a and lub(a,b); // this is the generic distance as input} int distancecount[maxleaves][maxleaves]; // {the number of different inputs of distance[a,b]} boolean inputtriples[maxedges][maxedges][maxedges]; int clustercount; float clusterlength[maxedges]; boolean clusterdone[maxedges]; // {cluster[i] is the ith edge of the tree being built // and tells the leaves on one side of that edge. // Each cluster is a subset of clusterfullset. // cluster[i] is logically equivalent to clusterfullset-cluster[i]. // The tree has clustercount different edges. // Edge i has length clusterlength[i]. } boolean clusterchild[maxedges][maxedges]; // {clusterchild[i,j] is true if cluster i is a child of cluster j} int clusterroot; //{the id of the cluster that is the root of the cluster graph} int componentarray[maxleaves]; // the list of vertices in a component of the Aho graph int vertexcount; int vertex[maxleaves]; // {the contents of the vertex of the Aho graph} boolean vertexadjacent[maxleaves][maxleaves]; // {true if they are adjacent vertices} boolean done; char ch; float tempreal; int i,j,k; boolean changed; double l2dev; int count; int best, besttriplet; float bestvalue, besttripletvalue; float minimumused; //{the minimum difference in the Aho graph} float maximumused; // {the maximum support in the Aho graph} float tmaximal; // {the maximum t used to build the local tree} float support[maxleaves][maxleaves]; int edgetype; // {edgetype = 1 if there is a weak criterion for an edge in the Aho graph; // it is 2 if there is a strict criterion using a,b,c; and b,a,d; // if it 3 if there is a very strict criterion using a,b,c; and b,a,c} int tempint; char treeleaves[maxleaves]; /*treeleaves[j][i] = 1 if i is in the set of leaves of the jth input tree*/ char incluster[maxedges][maxleaves]; /*cluster[i,j,k] is 1 if k is in the jth cluster of tree i*/ char cluster[maxedges][maxleaves]; float summary[6]; int summaryinputtriplefailure[6], summaryclustercount[6]; float minimum, epsilon; float inputtripletotal; int inputtriplefailure; void finishup(void); void finishup(void) { cout<<"Program has terminated."<>ch; exit(0); } void adddistance(int first, int second, float dis); void adddistance(int first, int second, float dis) { distancecount[first][second] = distancecount[first][second]+1; if (distancecount[first][second] == 1) distance[first][second] = dis; else distance[first][second] = (distance[first][second]*(distancecount[first][second]-1.0) + dis) / distancecount[first][second]; } int digit(char ); int digit(char ch) { if (ch == '1') return 1; else if (ch == '2') return 2; else if (ch == '3') return 3; else if (ch == '4') return 4; else if (ch == '5') return 5; else if (ch == '6') return 6; else if (ch == '7') return 7; else if (ch == '8') return 8; else if (ch == '9') return 9; else if (ch == '0') return 0; else return -1; } float computedindistance(int, int); float computedindistance(int first, int second) //{this computes the distance from first to lca(first,second) in the current input tree} { int i; float currentvalue; currentvalue = 0.0; // cout<<"Computing distance in tree "<>ch; } indistance>>actualleaves; cout<<"The input number of taxa is "<>inputtype; if (inputtype == 1) outcomment<<"Input via a list of lca distances of form: a b d(a,lub(a,b)) readings."<= maxidlength) done = true; else { idlength = idlength + 1; idstring[idlength]=ch; } } if ((actualleaves > maxleaves) || (actualleaves < 0)) { outcomment<<"Wrong number of species"<>k1; counter[i1] = 0; done = false; taxonidnumber[i1] = k1; reversetaxonidnumber[k1] = i1; counter[i1] = 1; // {read in the labels, at most labellength characters terminated by *} while (! done) { indistance>>ch; if (ch != '*') { labels[i1][counter[i1]] = ch; counter[i1] = counter[i1] + 1; if (counter[i1] >labellength) { done = true; outcomment<<"Label length "< actuallabellength) actuallabellength = counter[i1]; } } // {while not done do} } outcomment<<"Taxa:"<>ch; for (i1 = 1; i1<=actualleaves;i1++) for (j1 = 1; j1<=actualleaves;j1++) { distancecount[i1][j1] = 0; distance[i1][j1] = 0.0; } if (inputtype == 1) { // {read in the distance matrix} cout<<"Starting to read the distances."<>first>>second>>tempreal; if (first == 0) done = true; else { if (first > maxleaves) { outcomment<<"Illegal taxon "< maxleaves) { outcomment<<"Illegal taxon "<>ch; */ } } // {while} // echo distances /* for(i=1; i<=actualleaves; i++) for(j=1; j<=actualleaves; j++) { if (distancecount[i][j]>0) outcomment<>tempint; if (tempint == 0) done = true; else if (tempint > 0) if (tempint <= maxleaves) if (reversetaxonidnumber[tempint]>0) { treeleaves[reversetaxonidnumber[tempint]]=1; } } /* cout<<"Taxa in tree: "; for (j = 1; j<=actualleaves;j++) if ( treeleaves[treecount][j]==1) cout<>ch; */ /* outcomment<<"Taxa in tree "<>ch; */ } while (! treedone) // {read a new cluster until we get an empty cluster} { inclustercount = inclustercount+1; for (j=1; j<=actualleaves;j++) incluster[inclustercount][j]=0; done = false; while (! done) { temp1 = inclustercount; indistance>>tempint; if (tempint == 0) done = true; else if (tempint > 0) if (tempint <= maxleaves) { if (reversetaxonidnumber[tempint]>0) incluster[ temp1][reversetaxonidnumber[tempint]]=1; else { outcomment<<"Unidentified taxon "<>branchlength[temp1]; /* cout<<"Cluster "<>ch; */ // {check whether the cluster is empty} treedone = true; for (j = 1; j<=actualleaves;j++) if (incluster[inclustercount][j] ==1) treedone = false; if (treedone) inclustercount = inclustercount-1; } // {while not treedone} /* cout<<" Tree completed. "; cout<>ch; */ // {find the distances that can be inferred, and move them into the matrix distance} if (! inputdone) { // cout<<"Inferred distances in tree # "<>ch; */ } } // {if not inputdone} if (! inputdone) adjointriples(); } // {while not inputdone} outcomment<>allegedtreecount; cout<<"Read "<>ch; if (ch == '(') { cout<<'('; treeclustercount = treeclustercount + 1; clusterlevel = clusterlevel+1; openclusters[clusterlevel]=treeclustercount; numbermode = false; } else if (ch == ':') { if (numbermode) { cout<maxleaves) {outcomment<<"Bad taxon "<>length; branchlength[treeclustercount] = length; cout<>length; branchlength[openclusters[clusterlevel]] = length; numbermode = false; clusterlevel = clusterlevel-1; cout<<':'<-1) { if (numbermode) taxon = taxon*10+digit(ch); else { numbermode = true; taxon = digit(ch); } } } // {while not done} cout<0) outcomment<1) outcomment<<" 0 "<0) if (distancecount[vertex[i]][vertex[k]] > 0) { tempreal = distance[vertex[i]][vertex[k]] -distance[vertex[i]][vertex[j]]; if (tempreal > support[i][j]) if (tempreal > epsilon) { support[i][j] = tempreal; support[j][i] = tempreal; } } } void findsupportprimarysum(int clusterid); void findsupportprimarysum(int clusterid) { int i,j,k; float tempreal; for (i = 1; i<= actualleaves; i++) for (j = 1; j<= actualleaves; j++) support[i][j] = 0.0; vertexcount = 0; for (i = 1; i<= actualleaves; i++) if (cluster[clusterid][i]==1) { vertexcount = vertexcount + 1; vertex[vertexcount] = i; } for (i = 1; i<= vertexcount; i++) for (j = i+1; j<= vertexcount; j++) { tempreal = 0.0; for (k = 1; k<= vertexcount; k++) if (i != k) if (j != k) { if (distancecount[vertex[i]][vertex[j]]>0) if (distancecount[vertex[i]][vertex[k]] > 0) if (distance[vertex[i]][vertex[k]] -distance[vertex[i]][vertex[j]] > 0.0) tempreal = tempreal + distance[vertex[i]][vertex[k]] -distance[vertex[i]][vertex[j]]; if (distancecount[vertex[j]][vertex[i]]>0) if (distancecount[vertex[j]][vertex[k]] > 0) if (distance[vertex[j]][vertex[k]] -distance[vertex[j]][vertex[i]] > 0.0) tempreal = tempreal + distance[vertex[j]][vertex[k]] -distance[vertex[j]][vertex[i]]; } if (tempreal > epsilon) support[i][j] = tempreal; if (tempreal > epsilon) support[j][i] = tempreal; } } void findsupportconfirmedsum(int clusterid); void findsupportconfirmedsum(int clusterid) { int i,j,k; float tempreal, tempreal2; for (i = 1; i<= actualleaves; i++) for (j = 1; j<= actualleaves; j++) support[i][j] = 0.0; vertexcount = 0; for (i = 1; i<= actualleaves; i++) if (cluster[clusterid][i]==1) { vertexcount = vertexcount + 1; vertex[vertexcount] = i; } for (i = 1; i<= vertexcount; i++) for (j = i+1; j<= vertexcount; j++) { for (k = 1; k<= vertexcount; k++) if (i != k) if (j != k) if (distancecount[vertex[i]][vertex[j]]>0) if (distancecount[vertex[i]][vertex[k]] > 0) if (distancecount[vertex[j]][vertex[i]]>0) if (distancecount[vertex[j]][vertex[k]] > 0) { tempreal = distance[vertex[j]][vertex[k]] -distance[vertex[j]][vertex[i]]; tempreal2 = distance[vertex[i]][vertex[k]] -distance[vertex[i]][vertex[j]]; if (tempreal2 < tempreal) tempreal = tempreal2; if (tempreal > 0.0) { support[i][j] = support[i][j]+tempreal; support[j][i] = support[i][j]; } } if (support[i][j] < epsilon) support[i][j] = 0.0; if (support[j][i] < epsilon) support[j][i] = 0.0; } } void findsupportconfirmed(int clusterid); void findsupportconfirmed(int clusterid) { int i,j,k; float tempreal, tempreal2; for (i = 1; i<= actualleaves; i++) for (j = 1; j<= actualleaves; j++) support[i][j] = 0.0; vertexcount = 0; for (i = 1; i<= actualleaves; i++) if (cluster[clusterid][i]==1) { vertexcount = vertexcount + 1; vertex[vertexcount] = i; } for (i = 1; i<= vertexcount; i++) for (j = i+1; j<= vertexcount; j++) { for (k = 1; k<= vertexcount; k++) if (i != k) if (j != k) if (distancecount[vertex[i]][vertex[j]]>0) if (distancecount[vertex[i]][vertex[k]] > 0) if (distancecount[vertex[j]][vertex[i]]>0) if (distancecount[vertex[j]][vertex[k]] > 0) { tempreal = distance[vertex[j]][vertex[k]] -distance[vertex[j]][vertex[i]]; tempreal2 = distance[vertex[i]][vertex[k]] -distance[vertex[i]][vertex[j]]; if (tempreal2 < tempreal) tempreal = tempreal2; if (tempreal > support[i][j]) if (tempreal > epsilon) { support[i][j] = tempreal; support[j][i] = support[i][j]; } } } } void findahographsupported(float t); void findahographsupported(float t) { int i,j; minimumused = 3200000.0; maximumused = 0.0; for (i = 1; i<= vertexcount; i++) for (j = 1; j<= vertexcount; j++) vertexadjacent[i][j] = false; for (i = 1; i<= vertexcount; i++) for (j = i+1 ; j<= vertexcount; j++) if (support[i][j] > t) { vertexadjacent[i][j] = true; vertexadjacent[j][i] = true; if (support[i][j] < minimumused) minimumused = support[i][j]; if (support[i][j] > maximumused) maximumused = support[i][j]; } if (minimumused == 3200000.0) minimumused = 0.0; } void findclusterssupported(int clusterid); void findclusterssupported(int clusterid) { //{This finds the new clusters which are the children of cluster[clusterid]. //The Aho graph is drawn with vertices the members of cluster[clusterid]. //The new clusters will be the components of the graph. //the graph has vertexcount vertices; // vertex[i] is the leaf in the ith vertex // Vertices i and j are adjacent iff vertexadjacent[i,j] is true. int taxoncount; int i,j,k,l; int componentsfound[maxleaves], vertexcomponent[maxleaves];//: splits; float distanceup; taxoncount = 0; for (i= 1;i<=actualleaves;i++) if (cluster[clusterid][i]==1) taxoncount = taxoncount + 1; if (taxoncount > 1) { findsupportprimary(clusterid); findahographsupported(0.0); // if (edgetype == 1) findahographprimary(clusterid,0.0); // else if (edgetype == 3) findahographconfirmed(clusterid,0.0); for (i=1;i<=actualleaves; i++) componentsfound[i]=0; for (i = 1; i<=vertexcount;i++) if( componentsfound[i]==0) { component(i); for (k=1;k<=actualleaves;k++) vertexcomponent[k] = componentarray[k]; // outcomment<<"Component of "<0) { if (distance[vertex[j]][vertex[k]]0) if (distance[vertex[j]][vertex[l]]1} clusterdone[clusterid] = true; }// {findclusterssupported} void findclustersminimaldisconnectsupported(int clusterid); void findclustersminimaldisconnectsupported(int clusterid) { /*{This finds the new clusters which are the children of cluster[clusterid]. The Aho graph is drawn with vertices the members of cluster[clusterid]. The new clusters will be the components of the graph. the graph has vertexcount vertices; vertex[i] is the leaf in the ith vertex Vertices i and j are adjacent iff vertexadjacent[i,j] is true. The minimal disconnecting threshold is found by a bisection process: an interval containing the threshold is [boundingmin, boundingmax]. When t = boundingmin, the cluster stays connected. When t = boundingmax, the cluster disconnects. }*/ int taxoncount; int i,j,k,l; int componentsfound[maxleaves], vertexcomponent[maxleaves];//: splits; float distanceup; float t; boolean done, disconnected; float boundingmin, boundingmax; // {interval containing t} float previoust; int boundingmincount; float oldboundingmin; taxoncount = 0; for (i= 1;i<=actualleaves;i++) if (cluster[clusterid][i]==1) taxoncount = taxoncount + 1; if (taxoncount > 1) { t = 0.0; boundingmin = 0.0; boundingmax = 3200000.0; previoust = -1000.0; boundingmincount = 0; if (edgetype == 1) findsupportprimary(clusterid); else if (edgetype == 3) findsupportconfirmed(clusterid); else if (edgetype == 5) findsupportprimarysum(clusterid); else if (edgetype == 7) findsupportconfirmedsum(clusterid); done = false; while (! done) { //if (edgetype = 2 then findahographstrongedge(clusterid,t) //else findahographsupported(t); component(1); for (j=1; j<=vertexcount; j++) vertexcomponent[j] = componentarray[j]; count = 0; for (j = 1;j<=vertexcount;j++) if ( vertexcomponent[j]==1) count = count + 1; if (count < taxoncount) { boundingmax = t; disconnected = true; } else { boundingmin = minimumused; disconnected = false; } if (boundingmax == 3200000.0) boundingmax = maximumused; if (boundingmin == oldboundingmin) boundingmincount = boundingmincount + 1; else boundingmincount = 0; oldboundingmin = boundingmin; if (boundingmin >= boundingmax) if (disconnected) done = true; if (! done) { if (boundingmincount >= queue) t = boundingmin; else t = (boundingmin+boundingmax)/2.0; if (boundingmin>boundingmax) { done = true; } if (t == previoust) done = true; else previoust = t; } } //{while} if (! disconnected) { t = boundingmax; findahographsupported(t); component(1); for (j=1; j<=vertexcount; j++) vertexcomponent[j] = componentarray[j]; count = 0; for (j = 1;j<=vertexcount;j++) if ( vertexcomponent[j]==1) count = count + 1; if (count < taxoncount) { boundingmax = t; disconnected = true; } else { boundingmin = minimumused; disconnected = false; } if (! disconnected) { outcomment<<"Error in procedure: Cluster "<>ch; } } // {adjoin proper components, now that the graph is disconnected} // outcomment<<" For cluster "< tmaximal ) tmaximal = t; for (i = 1;i<=vertexcount;i++) componentsfound[i]=0; for (i = 1;i<=vertexcount;i++) if(componentsfound[i]==0) { component(i); for(j=1;j<=vertexcount;j++) vertexcomponent[j] = componentarray[j]; for(j=1;j<=vertexcount;j++) if (vertexcomponent[j]==1) componentsfound[j]=1; clustercount = clustercount + 1; for(j=1;j<=actualleaves;j++) cluster[clustercount][j]=0; clusterdone[clustercount] = false; for(j=1;j<=vertexcount;j++) if (vertexcomponent[j]==1) cluster[clustercount][vertex[j]]=1; // outcomment<<"cluster "<0) { if (distance[vertex[j]][vertex[k]]0) if (distance[vertex[j]][vertex[l]]1} clusterdone[clusterid] = true; } // {findclustersminimaldisconnectsupported} float clusterlubdis(int a, int b); float clusterlubdis(int a, int b) { /*{this computes the distance between a and lub(a,b) in cluster; we find all clusters that contain a but not b} */ int i; float currentdis; currentdis = 0.0; for (i = 1; i<=clustercount;i++) if (cluster[i][a]==1) if(cluster[i][b]==0) currentdis = currentdis + clusterlength[i]; return(currentdis); } // {clusterlubdis} boolean abslashc(int a, int b,int c); boolean abslashc(int a, int b,int c) {//this function returns true if ab|c in the cluster} int i; boolean supportfound; supportfound = false; for (i = 1; i<=clustercount; i++) if (! supportfound ) if(cluster[i][a]==1) if ( cluster[i][b]==1) if (cluster[i][c]==0) supportfound = true; return supportfound; } // {abslashc} void compareclusterwithinput(void); void compareclusterwithinput(void) { int i,j,k; // outcomment<<"\nComparison of computed distances with input distances.\n"; // outcomment<<" a b input d(a,lub(a,b)) computed d(a,lub(a,b))\n"; l2dev = 0.0; count = 0; for (i= 1;i<=actualleaves;i++) for (j= 1;j<=actualleaves;j++) if (i!= j) if (distancecount[i][j] > 0 ) { count = count + 1; tempreal = clusterlubdis(i,j); l2dev = l2dev +(distance[i][j]-tempreal)*(distance[i][j]-tempreal); // if (actualleaves < 15) outcomment<= 2) { inputtriplefailure = 0; inputtripletotal = 0; for (i = 1; i<= actualleaves; i++) for (j = i+1; j<=actualleaves; j++) for (k = 1; k<=actualleaves; k++) if (i!=k) if (j!=k) { inputtripletotal = inputtripletotal + 1; if (inputtriples[i][j][k]) if (! abslashc(i,j,k)) inputtriplefailure = inputtriplefailure + 1; } outcomment<<"Count of topological input triples not in the result is "< 1) if (countpossible < actualleaves) { for (j= 1; j<=actualleaves;j++) if (cluster[i][j]==1) outcodetreelist<0) { count = count + 1; total = total + distance[i][j]; if (distance[i][j]< minimum) minimum = distance[i][j]; } total = total/count; epsilon = total/10000.0; outcomment<<"Suggested epsilon is "<< epsilon< threshold OR l(b,c)-l(b,a) > threshold."< threshold AND l(b,d)-l(b,a) > threshold.');} outcomment<<"2. Accumulated primary. There is an edge (ab) if the sum of the positive "< threshold AND l(b,c)-l(b,a) > threshold."<= 2) { outcomment<<"Input triplets"<= 2) { outcomment<=2) { besttriplet = 2; besttripletvalue = summaryinputtriplefailure[2]; if (summaryinputtriplefailure[3]