Modified UPGMA
A modified version of UPGMA is used to construct a guide tree.
At a step that clusters sequences A and B in UPGMA,
d(C,AB) = 0.5 * ( d(A,C) + d(B,C) )
but we use a variant of UPGMA like this:
d(C,AB) = x * 0.5 * ( d(A,C) + d(B,C) ) + (1-x) * min( d(A,C), d(B,C) )
The x value is arbitrarily set to 0.1. This works well in handling fragment sequences from
the following reason.
For simplicity, let us consider distance d() calculated from
an alignment consisting of two alphabets (denoted by - and =) and gaps (denoted by spaces)
like this:
A. -------------------------=======-----
B. =======
C. -------=------------
D. ==-=------=--=--==--------==-=-=-----
where B is a fragment sequences of A (%id=100),
C is a fragment closely related to A, and
D is distantly related to A-C.
An appropriate clustering order is
(((A,B),C),D).
In the initial distance matrix,
d(A,B) is 0, d(A,C) is a small value,
and d(B,C) is a large value because B is not homologous to C.
A and B are clustered in the first step, as d(A,B)=0.
In the UPG method, the distance between C and cluster AB is
d(C,AB) = 0.5*( d(A,C) + d(B,C) ).
d(C,AB) is a large value because of d(B,C).
It may be larger than d(D,AB) and d(C,D) in the worst case.
As a result, (AB,D) or (C,D) may be clustered in the second step
( ((A,B),D),C) or ((A,B),(C,D)) ).
To avoid these,
d(C,AB) is set to min( d(A,C), d(B,C) ) = d(A,C).
The above method works wrong on a situation like
A. -----=======-------------------------
B. =======
C. --------------------
D. --------------------
where d(A,B)=d(A,C)=d(A,D)=d(C,D)=0 and d(B,C)≠0.
As a guide tree,
((A,B),(C,D)) (1)
is desirable,
but depending on the input order,
(((A,B),C),D) (2)
might be selected under the above method,
because d(C,AB)=d(D,AB)=d(C,D)=0 after the first clustering.
To avoid (2), we use a UPG style distance in part.