cbrc
MAFFT version 7

Multiple alignment program for amino acid or nucleotide sequences

Tips for handling a large dataset

>∼50,000 sequences × ∼50,000 sites; % identity >∼95

A typical case is a set of genomes from different strains of a virus species.  See here. (2020/Apr)

∼50,000 – ∼100,000 sequences × ∼5,000 sites incl. gaps

In versions ≥7.299, the default option is applicable to this size of data if the sequences are highly similar to each other (2016/Jul). 
% mafft in > out

By the --retree 1 option, the speed is increased (about two times) and the accuracy is reduced:

% mafft --retree 1 in > out
in which the re-estimation of guide tree is omitted.

A combination of the iterative refinement method and the progressive method is possible using a Ruby script, mafft-sparsecore.rb, included in the MAFFT package.

% mafft-sparsecore.rb -i in > out
This is more accurate than the default.  Also see Yamada et al. (2016) and try the "mafft-sparsecore" option in the online versionDetailed explanation on this script is being prepared

The G-INS-1 option is applicable to large data, when huge RAM and a large number of CPU cores are available.

% mafft --globalpair --thread n in > out
where n is the number of threads to run in parallel.  The G-INS-1 option is significantly more accurate than the above options in benchmarks tests (Yamada et al. 2016).  However, the computational cost of this option is impractically high for ∼100,000 sequences.

By a new flag, --large, the G-INS-1 option has become applicable to large data without using huge RAM.  Updated!

% mafft --large --globalpair --thread n in > out
This option uses files, instead of RAM, to store temporary data.  The default location of temporary files is $HOME/maffttmp/ (linux, mac and cygwin) or %TMP% (windows).  The location can be changed by setting the MAFFT_TMPDIR environmental variable. 

To run the G-INS-1 option faster on a cluster system, try the MPI version.

∼100,000 – ∼200,000 sequences × ∼5,000 sites incl. gaps (2016/Jul)

Versions ≥7.299 have an experimental option, --memsavetree,
% mafft --memsavetree in > out
which is applicable to larger data, but slower and slightly less accurate than the default in our benchmark.  For this option, too, the input sequences have to be highly similar to each other.  Also try the online version.

The --memsavetree option can be combined with the faster option.

% mafft --memsavetree --retree 1 in > out

To use the --memsavetree option in the mafft-sparsecore.rb script,

% mafft-sparsecore.rb -A '--memsavetree' -i in > out
Detailed explanation on this script is being prepared

Methods with chained guide trees are also available.

% mafft --pileup in > out
aligns the sequences just in the input order, and
% mafft --randomchain in > out
randomizes the order.  Their accuracy is controversial.  See Boyce et al. (2014) and Yamada et al. (2016) for details. 

Following options (Katoh & Toh 2007) are still available:

% mafft --parttree --retree 1 in > out
% mafft --parttree --retree 2 in > out
% mafft --parttree --retree 2 --partsize 1000 in > out
% mafft --dpparttree --retree 2 --partsize 1000 in > out

∼20 sequences × ∼1,000,000 bases

BUG!! Bug information: Versions 6.619 – 6.704 have a problem with this feature.  Please update to v6.705 or higher (2009/05/17).

MAFFT requires memory space proportional to L2 by default, where L is sequence length.  When the --memsave option is added or alignment length exceeds a threshold, a linear-space DP algorithm similar to Myers & Miller (1988) is used.  It is approximately two times slower than a normal DP.  If you have huge RAM space, add --nomemsave to always apply a normal DP (versions ≥6.620 only).

When the similarity among input sequences is high and the number of sequences N is small (up to ∼100), the FFT approximation is recommended to reduce the CPU time of the DP process.  This is enabled automatically.

% mafft --(no)memsave in > out

The --retree 1 option increases the speed (×∼2) and reduces the accuracy, for long alignment, too.

% mafft --(no)memsave --retree 1 in > out

Iterative refinement can be applied to improve the accuracy only when the similarity is high.

% mafft --(no)memsave --maxiterate 2 in > out

Note that MAFFT assumes that the all input sequences share the order of homologous sites or blocks.  If the sequences have repeat or inversion, use other tools such as FASTA and MUMmer.

Options to handle large data:

Argument Default
--retree 1 Approxmately two times faster but more rough than default. --retree 2
--maxiterate 2 Enhances the accuracy but not directly applicable to a large number of sequences. --maxiterate 0
--memsave Memory saving but approximately two times slower. auto
--fft For long (∼1,000,000 nt) conserved sequences. auto
--nofft For a large number of sequences (∼5,000). auto
--parttree For extremely large numbers of sequences (>100,000). disabled
--dpparttree For extremely large numbers of sequences (>100,000). disabled
--fastaparttree For extremely large numbers of sequences (>100,000). disabled
--partsize 1000 More accurate than default.
Valid with the --*parttree options.
--partsize 50
--groupsize 1 Does not align.  Recommended to be used with --reorder
Valid with the --*parttree options.
The sequences will be sorted according to similarity.
--groupsize (large)
--memsavetree Uses less memory space to build a guide tree. New disabled
--pileup Aligns the sequences in the input order. New disabled
--randomchain Aligns the sequences in a randomized order. New disabled
--randomseed Seed of random numbers. New 0
--treeout Outputs guide tree into the in.tree file in the current directory. disabled

The above options are not yet completely tested.  Please send a bug report to the author if you have any troubles in using these options.