#include
#include
#include
#include
#define boolean int
/*This program was written by
Stephen J. Willson
Department of Mathematics
Iowa State University
Ames, IA 50011 USA
swillson@iastate.edu
December 2003
This program implements the algorithm BUILD-WITH-DISTANCES.
It is given distance information in the file Build.input.
It attempts to create a tree S which is additive with these distances.
Not all a and b need to have l(a,b) given. Hence a natural
way for such an input file to arise is from distances in various
trees, when the desired tree is a supertree of the input trees.
The input file is Build.input
The output files are
(1) Build.output
This contains summary of all the results.
(2) Build.outcode
This contains summary information about the trees in a format that can be used for input.
(3) Build.nex
This nexus file can be used with PAUP* to draw the trees which have been generated.
Format of Build.input:
There are three types of input file allowed:
(A) Input type 1: with the distances explicit:
line 1: number_of_taxa 1
line 2: identification string terminated by the character *
The identification string may occupy more than one line, but it may
not contain an embedded character *
successive lines: the codenumbers and names of the taxa (at most labellength
characters terminated by *) in order
with the code number (an integer at most maxleaves), then the name (blanks are allowed), then *
The number of these lines should match number_of_taxa on line 1.
successive lines: the distances in the form
a b d(a,lub(a,b))
for example 1 2 6.57 to say the distance from taxon # 1 to lca(#1,taxon # 2) is 6.57
concluding line:0 0 0
This is a dummy line to indicate the end of the input.
The same a and b may appear at most maxnumbersubtrees times.
It need not have the same value in each case, but the values will be averaged.
Lines with a = b are ignored.
Blank lines may occur in the data
Note here that d(a,lub(a,b)) need not equal d(b,lub(b,a)).
These are the distances from a to the most recent common ancestor of a and b;
or from b to the most recent common ancestor of a and b.
Only in ultrametric situations are these necessarily the same.
Example of input of type 1:
6 1
No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.*
1 human *
2 chimpanzee *
7 horse *
4 whale *
5 bat *
9 cow *
1 2 1.0000000
1 7 3.0000000
1 4 3.0000000
2 1 1.0000000
2 7 3.0000000
2 4 3.0000000
7 1 2.0000000
7 2 2.0000000
7 4 1.0000000
4 1 3.0000000
4 2 3.0000000
4 7 2.0000000
1 4 3.0000000
1 5 2.0000000
1 9 3.0000000
4 1 3.0000000
4 5 3.0000000
4 9 1.0000000
5 1 1.0000000
5 4 2.0000000
5 9 2.0000000
9 1 3.0000000
9 4 1.0000000
9 5 3.0000000
2 5 1.5000000
2 9 1.0000000
5 2 2.0000000
5 9 2.0000000
9 2 3.0000000
9 5 3.5000000
0 0 0.0
(B) Input type 2: with the distances implicit:
line 1: number_of_taxa 2
line 2: identification string ending in *
successive lines: the codenumber and the names of the taxa (at most labellength characters terminated by *)
first subtree
5 17 63 81 82 83 0 the taxa in the first rooted tree, listed by code numbers, ended by 0
5 00.213the first rooted cluster ended by 0; followed by its branchlength to its parent
5 63 0 0.316 the second rooted cluster ended by 0, followed by its branchlength to its parent
...
0 0.0 a dummy ending the input of clusters for the first tree
17 81 83 85 89 0 the taxa in the second rooted tree, ended by 0
17 0 0.333the first rooted cluster ended by 0, followed by its branchlength to parent
...
0 0.0 a dummy ending the input of clusters for the second tree
... other rooted trees
0 0.0 dummy ending of the last rooted tree
0 dummy to end the input of trees
Note that singleton clusters must be included in the input of each tree.
Note that the dummies should include a numerical branchlength.
Blank lines can make the divisions easier.
6 2
No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.*
1 human *
2 chimpanzee *
7 horse *
4 whale *
5 bat *
9 cow *
1 2 7 4 0
1 0 1
2 0 1
7 0 1
4 0 2
1 2 0 2
7 4 0 1
0 0
1 4 5 9 0
1 0 2
4 0 1
5 0 1
9 0 1
1 5 0 1
4 9 0 2
0 0
2 5 9 0
2 0 1
5 0 2
9 0 3
2 9 0 0.5
0 0
0
In the example above, the first input tree is described as
1 2 7 4 0
1 0 1
2 0 1
7 0 1
4 0 2
1 2 0 2
7 4 0 1
0 0
This means that the leaves of the first tree were 1, 2, 7, and 4
so the root of the first input tree was the cluster (1,2, 7, 4).
The cluster (1) had branchlength 1 to its parent.
The cluster (2) had branchlength 1 to its parent.
The cluster (7) had branchlength 1 to its parent.
The cluster (4) had branchlength 2 to its parent.
The cluster (1,2) had branchlength 2 to its parent.
The cluster (7,4) had branchlength 1 to its parent.
There were no other clusters.
For example, l(2,4) = 1+2 = 3 and l(7,4) = 1.
This codifies the tree
(1274)
/ \
(12) (74)
/ \ / \
(1) (2) (4) (7)
in which branch lengths are indicated as follows in []:
(1274)
/ [2] \[1]
(12) (74)
[1]/ \[1] [2]/ \[1]
(1) (2) (4) (7)
(C) Input type 3 via a nexus-like description using parentheses:
line 1: number_of_taxa 3
line 2: identification string ending in *
successive lines: the codenumber and the names of the taxa (at most labellength characters terminated by *)
number_of_subtrees being input
first subtree in parenthesis notation
second subtree in parenthesis notation
...
last subtree in parenthesis notation
The parenthesis notation for the first tree above is:
(2:1, ((5:1,9:1):2,4:2):1)
Numbers after colons are the branchlength to the parent.
Leaves are numbers like 9.
Higher clusters are in parentheses.
Thus this says that the nonsingleton clusters are (2,5,9,4), (5,9), (5,9,4).
The length of the edge from 5 to its parent is 1.
The length of the edge from (5,9) to its parent is 2.
Example:
6 3
No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.*
1 human *
2 chimpanzee *
7 horse *
4 whale *
5 bat *
9 cow *
3
((1:1,2:1):2,(7:1,4:2):1)
((1:2,5:1):1,(4:1,9:1):2)
(5:2,(2:1,9:3):0.5)
SAMPLE OUTPUT:
Any of these input files will result in the following output:
BEGINNING OF THE OUTPUT FILE Build.output:
This program implements BUILD-WITH-DISTANCES.
It computes rooted supertrees by distance methods
that do not require all distances to be known.
Distances are given by an lca-distance function l(a,b)
where l(a,b) is the length of the path from a to the lowest common ancestor of a and b.
Equivalently, l(a,b) = d(a, lca(a,b)).
There are several different definitions of support in the various graphs.
1. Primary. There is an edge (ab) if either there is c such that
l(a,c)-l(a,b) > threshold OR l(b,c)-l(b,a) > threshold.
2. Accumulated primary. There is an edge (ab) if the sum of the positive
l(a,c)-l(a,b) and l(b,c)-l(b,a) exceeds the threshold.
3. Confirmed. There is an edge (ab) if there is c such that
l(a,c)-l(a,b) > threshold AND l(b,c)-l(b,a) > threshold.
4. Accumulated confirmed. There is an edge (ab) if the sum of the positive
minima of (l(a,c)-l(a,b) and l(b,c)-l(b,a)) exceeds the threshold.
The input number of taxa is 6
Input via nexus-type tree descriptions.
The input type is 3
The identification was:
No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.
Taxa:
1 human
2 chimpanzee
7 horse
4 whale
5 bat
9 cow
----------------
The null threshold tree S(0) with primary support:
List of clusters with root = cluster # 1:
# 1: 1 2 7 4 5 9 Root
The graph does not contain all singletons and is not a true tree.
The null threshold tree S(0) with primary support:
RMS L2 deviation: 2.30228
--------------
The minimal threshold tree Sap with accumulated primary support:
List of clusters with root = cluster # 1:
# 1: 1 2 7 4 5 9 Root
# 2: 1 2 5 0 1
# 3: 7 4 9 0 1
# 4: 1 2 0 0.5
# 5: 5 0 1
# 6: 7 0 1
# 7: 4 9 0 1
# 8: 1 0 1
# 9: 2 0 1
# 10: 4 0 1
# 11: 9 0 1
Here is a parenthetic form of the graph:
(((1,2),5),(7,(4,9)))
Here is a parenthetic form with edge lengths:
(((1:1,2:1):0.5,5:1):1,(7:1,(4:1,9:1):1):1)
The minimal threshold tree Sap with accumulated primary support:
RMS L2 deviation: 0.430228
Maximal threshold used was 4.75
--------------
The minimal threshold tree Sac with accumulated confirmed support:
List of clusters with root = cluster # 1:
# 1: 1 2 7 4 5 9 Root
# 2: 1 2 5 0 1
# 3: 7 4 9 0 1
# 4: 1 2 0 0.5
# 5: 5 0 1
# 6: 7 0 1
# 7: 4 0 1
# 8: 9 0 1
# 9: 1 0 1
# 10: 2 0 1
Here is a parenthetic form of the graph:
(((1,2),5),(7,4,9))
Here is a parenthetic form with edge lengths:
(((1:1,2:1):0.5,5:1):1,(7:1,4:1,9:1):1)
The minimal threshold tree Sac with accumulated confirmed support:
RMS L2 deviation: 0.688155
Maximal threshold used was 0.25
--------------
The minimal threshold tree Sc with confirmed support:
List of clusters with root = cluster # 1:
# 1: 1 2 7 4 5 9 Root
# 2: 1 2 5 0 1
# 3: 7 4 9 0 1
# 4: 1 2 0 0.5
# 5: 5 0 1
# 6: 7 0 1
# 7: 4 0 1
# 8: 9 0 1
# 9: 1 0 1
# 10: 2 0 1
Here is a parenthetic form of the graph:
(((1,2),5),(7,4,9))
Here is a parenthetic form with edge lengths:
(((1:1,2:1):0.5,5:1):1,(7:1,4:1,9:1):1)
The minimal threshold tree Sc with confirmed support:
RMS L2 deviation: 0.688155
Maximal threshold used was 0.25
--------------
The minimal threshold tree Sp with primary support:
List of clusters with root = cluster # 1:
# 1: 1 2 7 4 5 9 Root
# 2: 1 2 4 5 9 0 1
# 3: 7 0 1
# 4: 1 2 4 9 0 0.25
# 5: 5 0 1
# 6: 1 0 1
# 7: 2 0 1
# 8: 4 0 1
# 9: 9 0 1
Here is a parenthetic form of the graph:
(((1,2,4,9),5),7)
Here is a parenthetic form with edge lengths:
(((1:1,2:1,4:1,9:1):0.25,5:1):1,7:1)
The minimal threshold tree Sp with primary support:
RMS L2 deviation: 1.26909
Maximal threshold used was 2
--------------
Summary of the L2 deviations.
The tree with the smallest L2 deviation has the best fit.
Sap: 0.430228
Sac: 0.688155
Sc: 0.688155
Sp: 1.26909
The best tree is Sap.
END OF THE OUTPUT FILE.
INTERPRETING THIS OUTPUT:
(1) In the minimal threshold tree Sap with accumulated primary support:
The root is (127459)
There is a cluster (125) and the length of the edge to its parent is 1.
There is a cluster (749) and the length of the edge to its parent is 1.
There is a cluster (12) and the length of the edge to its parent is 0.5.
There is a cluster (5) and the length of the edge to its parent is 1.
There is a cluster (7) and the length of the edge to its parent is 1.
There is a cluster (49) and the length of the edge to its parent is 1.
There is a cluster (1) and the length of the edge to its parent is 1.
There is a cluster (5) and the length of the edge to its parent is 1.
There is a cluster (4) and the length of the edge to its parent is 1.
There is a cluster (9) and the length of the edge to its parent is 1.
The minimal threshold tree Sap can be written using parenthesis notation as
(((1,2),5),(7,(4,9)))
When distances are included, the same tree is written as
(((1:1,2:1):0.5,5:1):1,(7:1,(4:1,9:1):1):1)
The input implied certain lca distances l(a,b) for (a,b) in a domain D.
Once the tree Sap definition was computed,
new lca distances l'(a,b) could be inferred.
The RMS L2 error estimate is
sqrt ( sum (l(a,b)-l'(a,b))^2)
where the sum is over all (a,b) in D.
For this tree, the RMS L2 deviation is 0.430228
In the computation of Sap, the largest threshold that was utilized was 4.75.
The other trees can be interpreted analogously.
(2) At the end of the output file is a summary
of the L2 deviations for all four trees. In this case
Sap had the smallest L2 deviation and hence
appears superior to the other trees.
OBTAINING PICTURES OF THE GRAPHS:
The output file Build.nex is a nexus file that can be input
to PAUP*. Then PAUP* commands such as SHOW TREES will give a variety
of images of the trees.
For the input files given above, the file Build.nex should appear
as follows:
BEGINNING OF NEXUS FILE
#NEXUS
[Additive rooted supertrees by BUILD-WITH-DISTANCES.
No supertree ((12)(74)), ((15)(49)), (5(29)). Sap best, then Sc=Sac, then Sp. MRP = star.
Sap = minimal threshold tree with accumulated primary support
Sac = minimal threshold tree with accumulated confirmed support
Sc = minimal threshold tree with confirmed support
Sp = minimal threshold tree with primary support
]
BEGIN TAXA;
DIMENSIONS NTAX = 6;
TAXLABELS
1_human
2_chimpanzee
7_horse
4_whale
5_bat
9_cow
;
END;
BEGIN TREES;
TREE Sap = (((1,2),5),(3,(4,6)));
TREE Sap.dis = (((1:1,2:1):0.5,5:1):1,(3:1,(4:1,6:1):1):1);
TREE Sac = (((1,2),5),(3,4,6));
TREE Sac.dis = (((1:1,2:1):0.5,5:1):1,(3:1,4:1,6:1):1);
TREE Sc = (((1,2),5),(3,4,6));
TREE Sc.dis = (((1:1,2:1):0.5,5:1):1,(3:1,4:1,6:1):1);
TREE Sp = (((1,2,4,6),5),3);
TREE Sp.dis = (((1:1,2:1,4:1,6:1):0.25,5:1):1,3:1);
END;
END OF NEXUS FILE
Note that the third taxon is identified as 3 rather than 7
because PAUP* does not accept code number identifications of the taxa.
CURRENT CONSTRAINTS:
The maximum number of taxa is maxleaves = 250;
The maximum number of input trees is maxtrees = 100.
The maximum length of an identification string is maxidlength = 1500.
The maximum length of a taxon label is labellength = 100.
These may be changed in the constant declarations.
The identification string should not contain the symbol ]
and must terminate with the symbol *
In case the program appears to be in an infinite loop, one can abort
by simultaneously pressing
c
or possibly q
If excess space is required, one can delete all references
to the large array inputtriples
at the cost of losing a topological test of fitness to the data.
*/
const int maxidlength = 1500;
const int maxleaves = 145; // maximum number of taxa
const int labellength = 100;
const int maxedges = 2*maxleaves;
const int queue = 3; //wait doing bisection before trying the lower bound
ofstream outcomment, outnexus, outcodetreelist; // output files
ifstream indistance; // input data file
int inputtype; // {1 for distances, 2 for treedash format}
int idlength; //{length of id string}
char idstring[maxidlength];//identification string
boolean inputdone, treedone;
int treecount; // {the number of input trees}
int inclustercount;//{the number of clusters in the ith tree}
float branchlength[maxedges];//{branchlength to the parent}
boolean inclusterdone[maxedges];
int taxonidnumber[maxleaves]; // {the id number
// this is of use in preparing distances for various subtrees separately}
// reversetaxonidnumber: array[1..maxleaves] of integer; {the actualleaf
// identification, so reversetaxonidnumber[taxonidnumber[i]] = i
int reversetaxonidnumber[maxleaves] ; // {the actualleaf
// identification, so reversetaxonidnumber[taxonidnumber[i]] = i
int actualleaves; // number of taxa
char labels[maxleaves][labellength];
int actuallabellength;
int counter[maxleaves]; // {the counts of indices}
float distance[maxleaves][maxleaves];
// {distance[a,b] is the ith value of a distance between a and lub(a,b);
// this is the generic distance as input}
int distancecount[maxleaves][maxleaves];
// {the number of different inputs of distance[a,b]}
boolean inputtriples[maxedges][maxedges][maxedges];
int clustercount;
float clusterlength[maxedges];
boolean clusterdone[maxedges];
// {cluster[i] is the ith edge of the tree being built
// and tells the leaves on one side of that edge.
// Each cluster is a subset of clusterfullset.
// cluster[i] is logically equivalent to clusterfullset-cluster[i].
// The tree has clustercount different edges.
// Edge i has length clusterlength[i]. }
boolean clusterchild[maxedges][maxedges];
// {clusterchild[i,j] is true if cluster i is a child of cluster j}
int clusterroot; //{the id of the cluster that is the root of the cluster graph}
int componentarray[maxleaves]; // the list of vertices in a component of the Aho graph
int vertexcount;
int vertex[maxleaves]; // {the contents of the vertex of the Aho graph}
boolean vertexadjacent[maxleaves][maxleaves];
// {true if they are adjacent vertices}
boolean done;
char ch;
float tempreal;
int i,j,k;
boolean changed;
double l2dev;
int count;
int best, besttriplet;
float bestvalue, besttripletvalue;
float minimumused; //{the minimum difference in the Aho graph}
float maximumused; // {the maximum support in the Aho graph}
float tmaximal; // {the maximum t used to build the local tree}
float support[maxleaves][maxleaves];
int edgetype;
// {edgetype = 1 if there is a weak criterion for an edge in the Aho graph;
// it is 2 if there is a strict criterion using a,b,c; and b,a,d;
// if it 3 if there is a very strict criterion using a,b,c; and b,a,c}
int tempint;
char treeleaves[maxleaves];
/*treeleaves[j][i] = 1 if i is in the set of leaves of the jth input tree*/
char incluster[maxedges][maxleaves];
/*cluster[i,j,k] is 1 if k is in the jth cluster of tree i*/
char cluster[maxedges][maxleaves];
float summary[6];
int summaryinputtriplefailure[6], summaryclustercount[6];
float minimum, epsilon;
float inputtripletotal;
int inputtriplefailure;
void finishup(void);
void finishup(void)
{
cout<<"Program has terminated."<>ch;
exit(0);
}
void adddistance(int first, int second, float dis);
void adddistance(int first, int second, float dis)
{
distancecount[first][second] = distancecount[first][second]+1;
if (distancecount[first][second] == 1)
distance[first][second] = dis;
else
distance[first][second] = (distance[first][second]*(distancecount[first][second]-1.0) + dis)
/ distancecount[first][second];
}
int digit(char );
int digit(char ch)
{
if (ch == '1') return 1;
else if (ch == '2') return 2;
else if (ch == '3') return 3;
else if (ch == '4') return 4;
else if (ch == '5') return 5;
else if (ch == '6') return 6;
else if (ch == '7') return 7;
else if (ch == '8') return 8;
else if (ch == '9') return 9;
else if (ch == '0') return 0;
else return -1;
}
float computedindistance(int, int);
float computedindistance(int first, int second)
//{this computes the distance from first to lca(first,second) in the current input tree}
{
int i;
float currentvalue;
currentvalue = 0.0;
// cout<<"Computing distance in tree "<>ch;
}
indistance>>actualleaves;
cout<<"The input number of taxa is "<>inputtype;
if (inputtype == 1) outcomment<<"Input via a list of lca distances of form: a b d(a,lub(a,b)) readings."<= maxidlength) done = true;
else
{
idlength = idlength + 1;
idstring[idlength]=ch;
}
}
if ((actualleaves > maxleaves) || (actualleaves < 0))
{
outcomment<<"Wrong number of species"<>k1;
counter[i1] = 0;
done = false;
taxonidnumber[i1] = k1;
reversetaxonidnumber[k1] = i1;
counter[i1] = 1;
// {read in the labels, at most labellength characters terminated by *}
while (! done)
{
indistance>>ch;
if (ch != '*')
{
labels[i1][counter[i1]] = ch;
counter[i1] = counter[i1] + 1;
if (counter[i1] >labellength)
{
done = true;
outcomment<<"Label length "< actuallabellength) actuallabellength = counter[i1];
}
} // {while not done do}
}
outcomment<<"Taxa:"<>ch;
for (i1 = 1; i1<=actualleaves;i1++)
for (j1 = 1; j1<=actualleaves;j1++)
{
distancecount[i1][j1] = 0;
distance[i1][j1] = 0.0;
}
if (inputtype == 1)
{ // {read in the distance matrix}
cout<<"Starting to read the distances."<>first>>second>>tempreal;
if (first == 0) done = true;
else
{
if (first > maxleaves)
{
outcomment<<"Illegal taxon "< maxleaves)
{
outcomment<<"Illegal taxon "<>ch;
*/
}
} // {while}
// echo distances
/* for(i=1; i<=actualleaves; i++)
for(j=1; j<=actualleaves; j++)
{
if (distancecount[i][j]>0) outcomment<*>tempint;
if (tempint == 0) done = true;
else if (tempint > 0)
if (tempint <= maxleaves)
if (reversetaxonidnumber[tempint]>0)
{
treeleaves[reversetaxonidnumber[tempint]]=1;
}
}
/* cout<<"Taxa in tree: ";
for (j = 1; j<=actualleaves;j++)
if ( treeleaves[treecount][j]==1) cout<>ch;
*/
/*
outcomment<<"Taxa in tree "<>ch;
*/
}
while (! treedone) // {read a new cluster until we get an empty cluster}
{
inclustercount = inclustercount+1;
for (j=1; j<=actualleaves;j++)
incluster[inclustercount][j]=0;
done = false;
while (! done)
{
temp1 = inclustercount;
indistance>>tempint;
if (tempint == 0) done = true;
else if (tempint > 0)
if (tempint <= maxleaves)
{
if (reversetaxonidnumber[tempint]>0)
incluster[ temp1][reversetaxonidnumber[tempint]]=1;
else
{
outcomment<<"Unidentified taxon "<>branchlength[temp1];
/* cout<<"Cluster "<>ch;
*/
// {check whether the cluster is empty}
treedone = true;
for (j = 1; j<=actualleaves;j++)
if (incluster[inclustercount][j] ==1)
treedone = false;
if (treedone) inclustercount = inclustercount-1;
} // {while not treedone}
/* cout<<" Tree completed. ";
cout<>ch;
*/
// {find the distances that can be inferred, and move them into the matrix distance}
if (! inputdone)
{
// cout<<"Inferred distances in tree # "<>ch;
*/
}
} // {if not inputdone}
if (! inputdone) adjointriples();
} // {while not inputdone}
outcomment<>allegedtreecount;
cout<<"Read "<>ch;
if (ch == '(')
{
cout<<'(';
treeclustercount = treeclustercount + 1;
clusterlevel = clusterlevel+1;
openclusters[clusterlevel]=treeclustercount;
numbermode = false;
}
else if (ch == ':')
{
if (numbermode)
{
cout<maxleaves)
{outcomment<<"Bad taxon "<>length;
branchlength[treeclustercount] = length;
cout<>length;
branchlength[openclusters[clusterlevel]] = length;
numbermode = false;
clusterlevel = clusterlevel-1;
cout<<':'<-1)
{
if (numbermode) taxon = taxon*10+digit(ch);
else
{
numbermode = true;
taxon = digit(ch);
}
}
} // {while not done}
cout<0)
outcomment<1) outcomment<<" 0 "<0)
if (distancecount[vertex[i]][vertex[k]] > 0)
{
tempreal = distance[vertex[i]][vertex[k]] -distance[vertex[i]][vertex[j]];
if (tempreal > support[i][j])
if (tempreal > epsilon)
{
support[i][j] = tempreal;
support[j][i] = tempreal;
}
}
}
void findsupportprimarysum(int clusterid);
void findsupportprimarysum(int clusterid)
{
int i,j,k;
float tempreal;
for (i = 1; i<= actualleaves; i++)
for (j = 1; j<= actualleaves; j++)
support[i][j] = 0.0;
vertexcount = 0;
for (i = 1; i<= actualleaves; i++)
if (cluster[clusterid][i]==1)
{
vertexcount = vertexcount + 1;
vertex[vertexcount] = i;
}
for (i = 1; i<= vertexcount; i++)
for (j = i+1; j<= vertexcount; j++)
{
tempreal = 0.0;
for (k = 1; k<= vertexcount; k++)
if (i != k)
if (j != k)
{
if (distancecount[vertex[i]][vertex[j]]>0)
if (distancecount[vertex[i]][vertex[k]] > 0)
if (distance[vertex[i]][vertex[k]] -distance[vertex[i]][vertex[j]] > 0.0)
tempreal = tempreal + distance[vertex[i]][vertex[k]] -distance[vertex[i]][vertex[j]];
if (distancecount[vertex[j]][vertex[i]]>0)
if (distancecount[vertex[j]][vertex[k]] > 0)
if (distance[vertex[j]][vertex[k]] -distance[vertex[j]][vertex[i]] > 0.0)
tempreal = tempreal + distance[vertex[j]][vertex[k]] -distance[vertex[j]][vertex[i]];
}
if (tempreal > epsilon) support[i][j] = tempreal;
if (tempreal > epsilon) support[j][i] = tempreal;
}
}
void findsupportconfirmedsum(int clusterid);
void findsupportconfirmedsum(int clusterid)
{
int i,j,k;
float tempreal, tempreal2;
for (i = 1; i<= actualleaves; i++)
for (j = 1; j<= actualleaves; j++)
support[i][j] = 0.0;
vertexcount = 0;
for (i = 1; i<= actualleaves; i++)
if (cluster[clusterid][i]==1)
{
vertexcount = vertexcount + 1;
vertex[vertexcount] = i;
}
for (i = 1; i<= vertexcount; i++)
for (j = i+1; j<= vertexcount; j++)
{
for (k = 1; k<= vertexcount; k++)
if (i != k)
if (j != k)
if (distancecount[vertex[i]][vertex[j]]>0)
if (distancecount[vertex[i]][vertex[k]] > 0)
if (distancecount[vertex[j]][vertex[i]]>0)
if (distancecount[vertex[j]][vertex[k]] > 0)
{
tempreal = distance[vertex[j]][vertex[k]] -distance[vertex[j]][vertex[i]];
tempreal2 = distance[vertex[i]][vertex[k]] -distance[vertex[i]][vertex[j]];
if (tempreal2 < tempreal) tempreal = tempreal2;
if (tempreal > 0.0)
{
support[i][j] = support[i][j]+tempreal;
support[j][i] = support[i][j];
}
}
if (support[i][j] < epsilon) support[i][j] = 0.0;
if (support[j][i] < epsilon) support[j][i] = 0.0;
}
}
void findsupportconfirmed(int clusterid);
void findsupportconfirmed(int clusterid)
{
int i,j,k;
float tempreal, tempreal2;
for (i = 1; i<= actualleaves; i++)
for (j = 1; j<= actualleaves; j++)
support[i][j] = 0.0;
vertexcount = 0;
for (i = 1; i<= actualleaves; i++)
if (cluster[clusterid][i]==1)
{
vertexcount = vertexcount + 1;
vertex[vertexcount] = i;
}
for (i = 1; i<= vertexcount; i++)
for (j = i+1; j<= vertexcount; j++)
{
for (k = 1; k<= vertexcount; k++)
if (i != k)
if (j != k)
if (distancecount[vertex[i]][vertex[j]]>0)
if (distancecount[vertex[i]][vertex[k]] > 0)
if (distancecount[vertex[j]][vertex[i]]>0)
if (distancecount[vertex[j]][vertex[k]] > 0)
{
tempreal = distance[vertex[j]][vertex[k]] -distance[vertex[j]][vertex[i]];
tempreal2 = distance[vertex[i]][vertex[k]] -distance[vertex[i]][vertex[j]];
if (tempreal2 < tempreal) tempreal = tempreal2;
if (tempreal > support[i][j])
if (tempreal > epsilon)
{
support[i][j] = tempreal;
support[j][i] = support[i][j];
}
}
}
}
void findahographsupported(float t);
void findahographsupported(float t)
{
int i,j;
minimumused = 3200000.0;
maximumused = 0.0;
for (i = 1; i<= vertexcount; i++)
for (j = 1; j<= vertexcount; j++)
vertexadjacent[i][j] = false;
for (i = 1; i<= vertexcount; i++)
for (j = i+1 ; j<= vertexcount; j++)
if (support[i][j] > t)
{
vertexadjacent[i][j] = true;
vertexadjacent[j][i] = true;
if (support[i][j] < minimumused) minimumused = support[i][j];
if (support[i][j] > maximumused) maximumused = support[i][j];
}
if (minimumused == 3200000.0) minimumused = 0.0;
}
void findclusterssupported(int clusterid);
void findclusterssupported(int clusterid)
{
//{This finds the new clusters which are the children of cluster[clusterid].
//The Aho graph is drawn with vertices the members of cluster[clusterid].
//The new clusters will be the components of the graph.
//the graph has vertexcount vertices;
// vertex[i] is the leaf in the ith vertex
// Vertices i and j are adjacent iff vertexadjacent[i,j] is true.
int taxoncount;
int i,j,k,l;
int componentsfound[maxleaves], vertexcomponent[maxleaves];//: splits;
float distanceup;
taxoncount = 0;
for (i= 1;i<=actualleaves;i++)
if (cluster[clusterid][i]==1) taxoncount = taxoncount + 1;
if (taxoncount > 1)
{
findsupportprimary(clusterid);
findahographsupported(0.0);
// if (edgetype == 1) findahographprimary(clusterid,0.0);
// else if (edgetype == 3) findahographconfirmed(clusterid,0.0);
for (i=1;i<=actualleaves; i++)
componentsfound[i]=0;
for (i = 1; i<=vertexcount;i++)
if( componentsfound[i]==0)
{
component(i);
for (k=1;k<=actualleaves;k++)
vertexcomponent[k] = componentarray[k];
// outcomment<<"Component of "<**0)
{
if (distance[vertex[j]][vertex[k]]0)
if (distance[vertex[j]][vertex[l]]1}
clusterdone[clusterid] = true;
}// {findclusterssupported}
void findclustersminimaldisconnectsupported(int clusterid);
void findclustersminimaldisconnectsupported(int clusterid)
{
/*{This finds the new clusters which are the children of cluster[clusterid].
The Aho graph is drawn with vertices the members of cluster[clusterid].
The new clusters will be the components of the graph.
the graph has vertexcount vertices;
vertex[i] is the leaf in the ith vertex
Vertices i and j are adjacent iff vertexadjacent[i,j] is true.
The minimal disconnecting threshold is found by a
bisection process: an interval containing the threshold
is [boundingmin, boundingmax].
When t = boundingmin, the cluster stays connected.
When t = boundingmax, the cluster disconnects.
}*/
int taxoncount;
int i,j,k,l;
int componentsfound[maxleaves], vertexcomponent[maxleaves];//: splits;
float distanceup;
float t;
boolean done, disconnected;
float boundingmin, boundingmax; // {interval containing t}
float previoust;
int boundingmincount;
float oldboundingmin;
taxoncount = 0;
for (i= 1;i<=actualleaves;i++)
if (cluster[clusterid][i]==1) taxoncount = taxoncount + 1;
if (taxoncount > 1)
{
t = 0.0;
boundingmin = 0.0;
boundingmax = 3200000.0;
previoust = -1000.0;
boundingmincount = 0;
if (edgetype == 1) findsupportprimary(clusterid);
else if (edgetype == 3) findsupportconfirmed(clusterid);
else if (edgetype == 5) findsupportprimarysum(clusterid);
else if (edgetype == 7) findsupportconfirmedsum(clusterid);
done = false;
while (! done)
{
//if (edgetype = 2 then findahographstrongedge(clusterid,t)
//else
findahographsupported(t);
component(1);
for (j=1; j<=vertexcount; j++)
vertexcomponent[j] = componentarray[j];
count = 0;
for (j = 1;j<=vertexcount;j++)
if ( vertexcomponent[j]==1) count = count + 1;
if (count < taxoncount)
{
boundingmax = t;
disconnected = true;
}
else
{
boundingmin = minimumused;
disconnected = false;
}
if (boundingmax == 3200000.0) boundingmax = maximumused;
if (boundingmin == oldboundingmin) boundingmincount = boundingmincount + 1;
else boundingmincount = 0;
oldboundingmin = boundingmin;
if (boundingmin >= boundingmax)
if (disconnected)
done = true;
if (! done)
{
if (boundingmincount >= queue) t = boundingmin;
else t = (boundingmin+boundingmax)/2.0;
if (boundingmin>boundingmax)
{
done = true;
}
if (t == previoust) done = true;
else previoust = t;
}
} //{while}
if (! disconnected)
{
t = boundingmax;
findahographsupported(t);
component(1);
for (j=1; j<=vertexcount; j++)
vertexcomponent[j] = componentarray[j];
count = 0;
for (j = 1;j<=vertexcount;j++)
if ( vertexcomponent[j]==1) count = count + 1;
if (count < taxoncount)
{
boundingmax = t;
disconnected = true;
}
else
{
boundingmin = minimumused;
disconnected = false;
}
if (! disconnected)
{
outcomment<<"Error in procedure: Cluster "<>ch;
}
}
// {adjoin proper components, now that the graph is disconnected}
// outcomment<<" For cluster "< tmaximal ) tmaximal = t;
for (i = 1;i<=vertexcount;i++)
componentsfound[i]=0;
for (i = 1;i<=vertexcount;i++)
if(componentsfound[i]==0)
{
component(i);
for(j=1;j<=vertexcount;j++)
vertexcomponent[j] = componentarray[j];
for(j=1;j<=vertexcount;j++)
if (vertexcomponent[j]==1) componentsfound[j]=1;
clustercount = clustercount + 1;
for(j=1;j<=actualleaves;j++)
cluster[clustercount][j]=0;
clusterdone[clustercount] = false;
for(j=1;j<=vertexcount;j++)
if (vertexcomponent[j]==1) cluster[clustercount][vertex[j]]=1;
// outcomment<<"cluster "<0)
{
if (distance[vertex[j]][vertex[k]]0)
if (distance[vertex[j]][vertex[l]]1}
clusterdone[clusterid] = true;
} // {findclustersminimaldisconnectsupported}
float clusterlubdis(int a, int b);
float clusterlubdis(int a, int b)
{
/*{this computes the distance between a and lub(a,b)
in cluster;
we find all clusters that contain a but not b}
*/
int i;
float currentdis;
currentdis = 0.0;
for (i = 1; i<=clustercount;i++)
if (cluster[i][a]==1)
if(cluster[i][b]==0)
currentdis = currentdis + clusterlength[i];
return(currentdis);
} // {clusterlubdis}
boolean abslashc(int a, int b,int c);
boolean abslashc(int a, int b,int c)
{//this function returns true if ab|c in the cluster}
int i;
boolean supportfound;
supportfound = false;
for (i = 1; i<=clustercount; i++)
if (! supportfound )
if(cluster[i][a]==1)
if ( cluster[i][b]==1)
if (cluster[i][c]==0)
supportfound = true;
return supportfound;
} // {abslashc}
void compareclusterwithinput(void);
void compareclusterwithinput(void)
{
int i,j,k;
// outcomment<<"\nComparison of computed distances with input distances.\n";
// outcomment<<" a b input d(a,lub(a,b)) computed d(a,lub(a,b))\n";
l2dev = 0.0;
count = 0;
for (i= 1;i<=actualleaves;i++)
for (j= 1;j<=actualleaves;j++)
if (i!= j)
if (distancecount[i][j] > 0 )
{
count = count + 1;
tempreal = clusterlubdis(i,j);
l2dev = l2dev +(distance[i][j]-tempreal)*(distance[i][j]-tempreal);
// if (actualleaves < 15) outcomment<= 2)
{
inputtriplefailure = 0;
inputtripletotal = 0;
for (i = 1; i<= actualleaves; i++)
for (j = i+1; j<=actualleaves; j++)
for (k = 1; k<=actualleaves; k++)
if (i!=k)
if (j!=k)
{
inputtripletotal = inputtripletotal + 1;
if (inputtriples[i][j][k])
if (! abslashc(i,j,k))
inputtriplefailure = inputtriplefailure + 1;
}
outcomment<<"Count of topological input triples not in the result is "< 1)
if (countpossible < actualleaves)
{
for (j= 1; j<=actualleaves;j++)
if (cluster[i][j]==1) outcodetreelist<0)
{
count = count + 1;
total = total + distance[i][j];
if (distance[i][j]< minimum) minimum = distance[i][j];
}
total = total/count;
epsilon = total/10000.0;
outcomment<<"Suggested epsilon is "<< epsilon< threshold OR l(b,c)-l(b,a) > threshold."< threshold AND l(b,d)-l(b,a) > threshold.');}
outcomment<<"2. Accumulated primary. There is an edge (ab) if the sum of the positive "< threshold AND l(b,c)-l(b,a) > threshold."<= 2)
{
outcomment<<"Input triplets"<= 2)
{
outcomment<=2)
{
besttriplet = 2;
besttripletvalue = summaryinputtriplefailure[2];
if (summaryinputtriplefailure[3]*