MAFFT ver.7 - a multiple sequence alignment program

Tips for handling a large dataset

For a set of a large number of short sequences, try version 7.299 or later, which have higher performance than earlier versions. Online service for this type of problem is also available.
For long sequences, see below.

>∼50,000 sequences × ∼50,000 sites; % identity >∼95

A typical case is a set of genomes from different strains of a virus species. See here. (2020/Apr)

∼50,000 – ∼100,000 sequences × ∼5,000 sites incl. gaps

In versions ≥7.299, the default option is applicable to this size of data if the sequences are highly similar to each other (2016/Jul).

% mafft in > out

By the --retree 1 option, the speed is increased (about two times) and the accuracy is reduced:

% mafft --retree 1 in > out

in which the re-estimation of guide tree is omitted.

A combination of the iterative refinement method and the progressive method is possible using a Ruby script, mafft-sparsecore.rb, included in the MAFFT package.

% mafft-sparsecore.rb -i in > out

This is more accurate than the default. Also see Yamada et al. (2016) and try the "mafft-sparsecore" option in the online version. Detailed explanation on this script is being prepared.

The G-INS-1 option is applicable to large data, when huge RAM and a large number of CPU cores are available.

% mafft --globalpair --thread n in > out

where n is the number of threads to run in parallel. The G-INS-1 option is significantly more accurate than the above options in benchmarks tests (Yamada et al. 2016). However, the computational cost of this option is impractically high for ∼100,000 sequences.

By a new flag, --large, the G-INS-1 option has become applicable to large data without using huge RAM. Updated!

% mafft --large --globalpair --thread n in > out

This option uses files, instead of RAM, to store temporary data. The default location of temporary files is $HOME/maffttmp/ (linux, mac and cygwin) or %TMP% (windows). The location can be changed by setting the MAFFT_TMPDIR environmental variable.

To run the G-INS-1 option faster on a cluster system, try the MPI version.

∼100,000 – ∼200,000 sequences × ∼5,000 sites incl. gaps (2016/Jul)

Versions ≥7.299 have an experimental option, --memsavetree,

% mafft --memsavetree in > out

which is applicable to larger data, but slower and slightly less accurate than the default in our benchmark. For this option, too, the input sequences have to be highly similar to each other. Also try the online version.

The --memsavetree option can be combined with the faster option.

% mafft --memsavetree --retree 1 in > out

To use the --memsavetree option in the mafft-sparsecore.rb script,

% mafft-sparsecore.rb -A '--memsavetree' -i in > out

Detailed explanation on this script is being prepared.

Methods with chained guide trees are also available.

% mafft --pileup in > out

aligns the sequences just in the input order, and

% mafft --randomchain in > out

randomizes the order. Their accuracy is controversial. See Boyce et al. (2014) and Yamada et al. (2016) for details.

Following options (Katoh & Toh 2007) are still available:

% mafft --parttree --retree 1 in > out

% mafft --parttree --retree 2 in > out

% mafft --parttree --retree 2 --partsize 1000 in > out

% mafft --dpparttree --retree 2 --partsize 1000 in > out

∼20 sequences × ∼1,000,000 bases

Bug information: Versions 6.619 – 6.704 have a problem with this feature. Please update to v6.705 or higher (2009/05/17).

MAFFT requires memory space proportional to L² by default, where L is sequence length. When the --memsave option is added or alignment length exceeds a threshold, a linear-space DP algorithm similar to Myers & Miller (1988) is used. It is approximately two times slower than a normal DP. If you have huge RAM space, add --nomemsave to always apply a normal DP (versions ≥6.620 only).

When the similarity among input sequences is high and the number of sequences N is small (up to ∼100), the FFT approximation is recommended to reduce the CPU time of the DP process. This is enabled automatically.

% mafft --(no)memsave in > out

The --retree 1 option increases the speed (×∼2) and reduces the accuracy, for long alignment, too.

% mafft --(no)memsave --retree 1 in > out

Iterative refinement can be applied to improve the accuracy only when the similarity is high.

% mafft --(no)memsave --maxiterate 2 in > out

Note that MAFFT assumes that the all input sequences share the order of homologous sites or blocks. If the sequences have repeat or inversion, use other tools such as FASTA and MUMmer.

Options to handle large data:

Argument		Default
`--retree 1`	Approxmately two times faster but more rough than default.	`--retree 2`
`--maxiterate 2`	Enhances the accuracy but not directly applicable to a large number of sequences.	`--maxiterate 0`
`--memsave`	Memory saving but approximately two times slower.	auto
`--fft`	For long (∼1,000,000 nt) conserved sequences.	auto
`--nofft`	For a large number of sequences (∼5,000).	auto
`--parttree`	For extremely large numbers of sequences (>100,000).	disabled
`--dpparttree`	For extremely large numbers of sequences (>100,000).	disabled
`--fastaparttree`	For extremely large numbers of sequences (>100,000).	disabled
`--partsize 1000`	More accurate than default. Valid with the *`--parttree`** options.	`--partsize 50`
`--groupsize 1`	Does not align. Recommended to be used with `--reorder`. Valid with the *`--parttree`** options. The sequences will be sorted according to similarity.	`--groupsize (large)`
`--memsavetree`	Uses less memory space to build a guide tree. New	disabled
`--pileup`	Aligns the sequences in the input order. New	disabled
`--randomchain`	Aligns the sequences in a randomized order. New	disabled
`--randomseed`	Seed of random numbers. New	0
`--treeout`	Outputs guide tree into the in.tree file in the current directory.	disabled

The above options are not yet completely tested. Please send a bug report to the author if you have any troubles in using these options.

Multiple alignment program for amino acid or nucleotide sequences

Tips for handling a large dataset

>∼50,000 sequences × ∼50,000 sites; % identity >∼95

∼50,000 – ∼100,000 sequences × ∼5,000 sites incl. gaps

∼100,000 – ∼200,000 sequences × ∼5,000 sites incl. gaps (2016/Jul)

∼20 sequences × ∼1,000,000 bases