cbrc
MAFFT version 7

Multiple alignment program for amino acid or nucleotide sequences

Algorithms and parameters (unfinished)

MAFFT offers various multiple alignment strategies. They are classified into three types, (a) the progressive method, (b) the iterative refinement method with the WSP score, and (c) the iterative refinment method using both the WSP and consistency scores. In general, there is a tradeoff between speed and accuracy. The order of speed is a > b > c, whereas the order of accuracy is a < b < c. The results of benchmarks can be seen here. The following are the detailed procedures for the major options of MAFFT.

(a) FFT-NS-1, FFT-NS-2 — Progressive methods

prog.png
These are simple progressive methods like ClustalW. By using the several new techniques described below, these options can align a large number of sequences (up to ∼5,000) on a standard desktop computer. The qualities of the resulting alignments are shown here. The detailed algorithms are described in Katoh et al. (2002). The following techniques are used to improve the performance.

FFT approximation.  (Not yet written) See Katoh et al. (2002).

k-mer counting.  To accelerate the initial calculation of the distance matrix, which requires a CPU time of O(N2) steps, a rough method similar to the 'quicktree' option of ClustalW is adopted, in which the number of k-mers shared by a pair of sequences is counted and regarded as an approximation of the degree of similarity. MAFFT uses the very rapid method proposed by Jones et al. (1992) with a minor modification (Katoh et al. 2002): (1) The 20 amino acids are compressed to 6 alphabets, according to Dayhoff et al. (1978), and (2) MAFFT performs the second progressive alignment (FFT-NS-2) in order to improve the accuracy.

Modified UPGMA.  A modified version of UPGMA is used to construct a guide tree, which works well for handling fragment sequences.

The second progressive alignment.  The accuracy of the second progressive alignment (FFT-NS-2) is slightly higher than that of the first progressive alignment (FFT-NS-1) according to the BAliBASE test, but the amount CPU time required by FFT-NS-2 is approximately two times longer than that by FFT-NS-1.

(b) FFT-NS-i, NW-NS-i — Iterative refinement method

iter.png
The accuracy of progressive alignment can be improved by the iterative refinement method (Berger and Munson 1991, Gotoh 1993). A simplified version of PRRN is implemented as the FFT-NS-i option of MAFFT. In FFT-NS-i, an initial alignment by FFT-NS-2 is subjected to an iterative refienment process.

Objective function.  The weighted sum-of-pairs (WSP) score proposed by Gotoh is used.

Tree-dependent partitioning.  (Not yet written) See Hirosawa et al.

Effect of FFT.  To test the effect of the FFT approximation, we also implemented the NW-NS-x options, in which the FFT approximation is disabled, but the other procedures are the same as those in the corresponding FFT-NS-x. There was no significant reduction in the accuracy by introducing the FFT approximation (Katoh et al. 2002).

(c) L-INS-i, E-INS-i, G-INS-i — Iterative refinement methods using WSP and consistency scores

cons.png
In order to obtain more accurate alignments in extremely difficult cases, three new options, L-INS-i, G-INS-i and E-INS-i, have been added to recent versions (v.≥5) of MAFFT. These options use a new objective function combining the WSP score (Gotoh) explained above and the COFFEE-like score (Notredame et al.), which evaluates the consistency between a multiple alignment and pairwise alignments (Katoh et al. 2005).

For pairwise alignment, three different types of algorithms are implemented, global alignment (Needleman-Wunsch), local alignment (Smith-Waterman) with affine gap costs (Gotoh) and local alignment with generalized affine gap costs (Altschul). The differences in the accuracy values among these methods are small for the currently available benchmarks, as shown here. However, each of them has different characteristics, according to the algorithm in the pairwise alignment stage:

Consistency score.  The COFFEE objective function was originally proposed by Notredame et al. (1998), and the extended versions are used in TCoffee and ProbCons. MAFFT also adopts a similar objective function, as described in Katoh et al. (2005). However, the consistency among three sequences (called 'library extension' in TCoffee) is currently not calculated in MAFFT, because the improvement in accuracy by library extension was limited to alignments consisting of a small number (<10) of sequences in our preliminary tests. If library extention is needed, then please use TCoffee or ProbCons.

Consistency + WSP.  Instead, the WSP score is summed with the consistency score in the objective function of MAFFT. The use of the WSP score has the merit that a pattern of gaps can be incorporated into the objective function. This is probably the reason why MAFFT achieves higher accuracy than ProbCons and TCoffee for alignments consisting of many (∼10 - ∼100) sequences. This suggests that the pattern of gaps within a group to be aligned is important information when aligning two groups of proteins (and evaluating homology between distantly related protein families).

Parameters

Scoring matrix for amino acid alignment.  The BLOSUM62 matrix is adopted as a default scoring matrix, because this showed slightly higher accuracy values than the BLOSUM80, 45, JTT200PAM, 100PAM and Gonnet matrices in SABmark tests.

Scoring matrix for nucleotide alignment.  The default scoring matrix is derived from Kimura's two-parameter model. The ratio of transitions to transversions is set at 2 by default. Other parameters can be used, but have not yet been tested.

Gap penalties for proteins.  The default gap penalties for amino acid alignments have been changed in v.4.0. Note that the current version of MAFFT returns an entirely different alignment from v.<4.0. In v.4.0, two major gap penalties (--op [gap open penalty] and --ep [offset value, which functions like a gap extension penalty, see the mafft3 paper for definition]) were tuned by applying the FFT-NS-2 option to a part of the SABmark benchmark. We adopted the parameter set (--op 1.53 --ep 0.123) optimized for SABmark, because this works better for other benchmark (HOMSTRAD, PREFAB and BAliBASE) tests than the previous one (--ep 2.4 --ep 0.06). Other parameters might work better in other situations. Consistency-based options have more parameters (L-INS-i has four more parameters and E-INS-i has six more parameters, as explained here). We determined these additional parameters so that the Smith-Waterman alignment function used in L-INS-i returns a local alignment similar to that generated by FASTA, but we have not closely tuned them yet. In our tests using SABmark, the accuracy values can be improved by 2-3% by tuning these parameters, but this improvement may result from overfitting.

Gap penalties for RNAs.  The default gap penalties for nucleotide alignment have changed in v.5.6. Note that the current version of MAFFT returns an entirely different alignment from v.<5.6. In the former versions (v.<5.6), the default gap penalties for nucleotide alignments were set at the same values as those for amino acid alignments. According to BRAliBASE, these penalties result in very bad alignments for RNAs. The newer versions (v.≥5.6) use a different penalties for nucleotide alignment; the penalty values are set to three times larger than those for amino acids. This is not yet the optimal value for BRAliBASE. The BRAliBASE score can be improved by closely tuning the penalty values, but we have not adopted the optimized penalties, because we are not sure whether they are applicable to a wide range of problems.

Average score or log expectation score.  A different parameter set from from that described above is used in MUSCLE, which has an algorithm similar to that of NW-NS-i. MUSCLE improved in the accuracy of multiple sequence alignment by introducing better parameters than those of the previous version (v3.89) of MAFFT (shown in gray letters in these tables). The latest version of MAFFT uses the re-adjusted gap penalties (see above) with a conventional average score. As Wallace et al. (2005) reported that the log expectation score outperforms the conventional average score, we are planning to examine the effectiveness of the log expectation score proposed in MUSCLE.

Mafft-homologs

The accuracy of an alignment of a few distantly related sequences is considerably improved when they are aligned together with their close homologs. The reason for the improvement is probably the same as that for PSI-BLAST. That is, the positions of highly conserved residues, those with many gaps and other additional information are provided by close homologs. According to Katoh et al. (2005), the improvement by adding close homologs is 10% or so, which is comparable to the improvement achieved by incorporating the structural information of a pair of sequences. Mafft-homologs in the mafft server works like this:

  1. Collect a number (50 by default) of close homologs (E=1e-10 by default) of the input sequences.
  2. Align the input sequences and homologs together using the L-INS-i strategy.
  3. Remove the homologs.

mafft-homologs

A service based on a similar idea is also available in PRALINE. Note that mafft-homologs aims only to improve the alignment accuracy, but not to comprehensively collect the full-length sequences of homologs. Thus, the resulting alignment with homologs by mafft-homologs is not appropriate for evolutionary analyses. For such a purpose, services with complete sequences, such as PipeAlign, are more appropriate.

Mapping a nucleotide base to a complex number

When the FFT approximation is used in nucleotide alignment, MAFFT v.5.743 and previous versions convert a nucleotide base to a set of four real numbers. For example, the sequence AACGTCT is converted to a set of four arrays: (1,1,0,0,0,0,0), (0,0,1,0,0,1,0), (0,0,0,1,0,0,0) and (0,0,0,0,1,0,1). In v.5.744, this will be replaced with a more efficient method, according to Rockwood et al. (2005). That is, AACGTCT is converted to (i,i,-1,1,-i,1,-i), where i is an imaginary unit.

Correction formula for the 6mer distance

Some options (FFT-NS-2, PartTree, etc.) use the number of 6-tuples shared between every sequence pair as the basis of the guide tree.  Mafft3-5 computes the distance Dij between sequences i and j as

Dij = 1 - Sij / min( Sii, Sjj )

where Sij is the number of shared 6-tuples by sequences i and j.  The expected value of Dij between two unrelated sequences depends on the lengths of the sequences; when comparing a very short sequence and a very long sequence, Dij is near 0 by chance, even if the sequences are unrelated to each other.  The situation is depicted in the figure below, where the red points indicate Dij calculated for randomly generated aa sequences (by ROSE) with various pairs of lengths.  The cyan surface is

f(x,y) = y/x * 0.1 + 10000 / ( x + 10000 ) + 0.01

where x and y are the lengths of longer and shorter sequences, respectively.  The function f(,) was empirically determined so that the surface is close to the red points. Mafft6 uses Dij' instead of Dij,

Dij' = Dij / f(x,y).

lenfac.png

References

(Not yet finished)