MAFFT version 6

Multiple alignment program for amino acid or nucleotide sequences

Tips for handling a large dataset

If the number of sequences is < 10,000, MAFFT version ≥5.850 automatically selects a moderately fast method that can process a large dataset. 
% mafft in > out

If this abnormally terminates or or you have extremely many (>10,000) sequenecs to be aligned, try to manually select an appropriate combination of the following options.

Argument Default
--retree 1 Approxmately two times faster but more rough than default --retree 2
--maxiterate 2 Enhances the accuracy but not applicable to many sequences --maxiterate 0
--memsave Memory saving but approximately two times slower auto
--fft For long (∼1,000,000 nt) conserved sequences auto
--nofft For many (∼5,000) sequences auto
--parttree For extremely many (>10,000) sequences disabled
--dpparttree For extremely many (>10,000) sequences disabled
--fastaparttree For extremely many (>10,000) sequences disabled
--partsize 1000 More accurate than default --partsize 50
--groupsize 1 Does not align.  Recommended to be used with --reorder
The sequences will be sorted according to similarity.
--groupsize (large)
--treeout Outputs the guide tree disabled

BUG!! Bug information: Version 5.830 (2006/04/24) crashes when a long (>32,767) gap is being inserted.  Please update to v5.850 or higher.

Few (∼20 sequences) × Very long (∼1,000,000 nt)

BUG!! Bug information: Versions 6.619 - 6.704 have a problem with this feature.  Please update to v6.705 or higher (2009/05/17).

MAFFT requires memory space proportional to L2 by default, where L is sequence length.  When the --memsave option is added or alignment length exceeds a threshold, however, a linear-space DP algorithm similar to Myers & Miller (1988) is used.  It is not yet tested whether the use of this algorithm sacrifices the accuracy of resulting alignment or not.  Moreover, it is approximately two times slower than a normal DP.  If you have a huge RAM, add --nomemsave to always apply a normal DP (versions ≥6.620 only).

When the similarity among input sequences is high and the number of sequences N is small (up to ∼100), the FFT approximation is highly recommended to reduce the CPU time of the DP process from O(L2) to O(L).

% mafft --fft --(no)memsave in > out
Time complexity: O(NL)+O(N3) (when input sequences are highly conserved) to O(NL2)+O(N3) (when the similarity among input sequences is weak)
Space complexity: O(NL)+O(N2)

The re-estimation of guide tree can be disabled by --retree 1, by which the accuracy is reduced while the speed is approximately doubled, in comparison with the default.

% mafft --fft --(no)memsave --retree 1 in > out

Iterative refinement can be applied to improve the accuracy only when the similarity is high.

% mafft --fft --(no)memsave --maxiterate 2 in > out

Note that MAFFT is applicable only to globally homologous input sequences.  If the sequences have repeat or inversion, use other tools such as FASTA and MUMmer.

Many (∼5,000 sequences) × Short (∼1,000 aa or nt, incl. gaps)

When the number of sequences is large, the FFT approximation requires a large memory space and rather increases the CPU time.  So FFT is automatically turned off at the later stage of progressive alignment.  You can manually disable FFT by adding the --nofft option.

The re-estimation of guide tree can be disabled by --retree 1, by which the accuracy is reduced while the speed is approximately doubled, in comparison with the default.

% mafft --retree 1 in > out
Time complexity: O(NL2)+O(N3)
Space complexity: O(NL)+O(N2)+O(L2)

A key technique for handling many sequences is the 3 mer- or 6 mer-based algorithm to roughly estimate a pairwise distance (Higgins & Sharp 1988; Jones et al. 1992; Katoh et al. 2002).  Another program package MUSCLE (Edgar 2004) adopted the same algorithm.  MUSCLE is worth trying because it has a more efficient UPGMA routine than that of MAFFT.

Many (∼5,000 sequences) × Long (∼10,000 aa or nt, incl. gaps)

When a large number of diverged sequences are involved, the alignment length sometimes becomes large because many gaps are needed.  In such a case, the memory-saving algorithm is automatically applied.  To apways apply a normal DP, add --nomemsave (versions ≥6.620 only), if you have a huge RAM.  MAFFT tends to generate longer alignments (with more and longer gaps) than other tools such as CLUSTAL W.
% mafft --(no)memsave --retree 1 in > out
Time complexity: O(NL2)+O(N3)
Space complexity: O(NL)+O(N2)

Extremely many (∼50,000 sequences) × Short (∼1,000 aa or nt, incl. gaps)

See the PartTree paper.
% mafft --parttree --retree 1 in > out
% mafft --parttree --retree 2 in > out
% mafft --parttree --retree 2 --partsize 1000 in > out
% mafft --fastaparttree --retree 2 --partsize 1000 in > out
The above options are tested using only a small number of examples.  Please send a bug report to the author if you have any trouble in using these options.