% mafft in > out
By the --retree 1 option, the speed is increased (about two times) and the accuracy is reduced:
% mafft --retree 1 in > outin which the re-estimation of guide tree is omitted.
A combination of the iterative refinement method and the progressive method is possible using a Ruby script, mafft-sparsecore.rb, included in the MAFFT package.
% mafft-sparsecore.rb -i in > outThis is more accurate than the default. Also see Yamada et al. (2016) and try the "mafft-sparsecore" option in the online version. Detailed explanation on this script is being prepared.
The G-INS-1 option is applicable to large data, when huge RAM and a large number of CPU cores are available.
% mafft --globalpair --thread n in > outwhere n is the number of threads to run in parallel. The G-INS-1 option is significantly more accurate than the above options in benchmarks tests (Yamada et al. 2016). However, the computational cost of this option is impractically high for ∼100,000 sequences.
By a new flag, --large, the G-INS-1 option has become applicable to large data without using huge RAM. Updated!
% mafft --large --globalpair --thread n in > outThis option uses files, instead of RAM, to store temporary data. The default location of temporary files is $HOME/maffttmp/ (linux, mac and cygwin) or %TMP% (windows). The location can be changed by setting the MAFFT_TMPDIR environmental variable.
To run the G-INS-1 option faster on a cluster system, try the MPI version.
% mafft --memsavetree in > outwhich is applicable to larger data, but slower and slightly less accurate than the default in our benchmark. For this option, too, the input sequences have to be highly similar to each other. Also try the online version.
The --memsavetree option can be combined with the faster option.
% mafft --memsavetree --retree 1 in > out
To use the --memsavetree option in the mafft-sparsecore.rb script,
% mafft-sparsecore.rb -A '--memsavetree' -i in > outDetailed explanation on this script is being prepared.
Methods with chained guide trees are also available.
% mafft --pileup in > outaligns the sequences just in the input order, and
% mafft --randomchain in > outrandomizes the order. Their accuracy is controversial. See Boyce et al. (2014) and Yamada et al. (2016) for details.
Following options (Katoh & Toh 2007) are still available:
% mafft --parttree --retree 1 in > out
% mafft --parttree --retree 2 in > out
% mafft --parttree --retree 2 --partsize 1000 in > out
% mafft --dpparttree --retree 2 --partsize 1000 in > out
MAFFT requires memory space proportional to L2 by default, where L is sequence length. When the --memsave option is added or alignment length exceeds a threshold, a linear-space DP algorithm similar to Myers & Miller (1988) is used. It is approximately two times slower than a normal DP. If you have huge RAM space, add --nomemsave to always apply a normal DP (versions ≥6.620 only).
When the similarity among input sequences is high and the number of sequences N is small (up to ∼100), the FFT approximation is recommended to reduce the CPU time of the DP process. This is enabled automatically.
% mafft --(no)memsave in > out
The --retree 1 option increases the speed (×∼2) and reduces the accuracy, for long alignment, too.
% mafft --(no)memsave --retree 1 in > out
Iterative refinement can be applied to improve the accuracy only when the similarity is high.
% mafft --(no)memsave --maxiterate 2 in > out
Note that MAFFT assumes that the all input sequences share the order of homologous sites or blocks. If the sequences have repeat or inversion, use other tools such as FASTA and MUMmer.
Options to handle large data:
Argument | Default | |
---|---|---|
--retree 1 | Approxmately two times faster but more rough than default. | --retree 2 |
--maxiterate 2 | Enhances the accuracy but not directly applicable to a large number of sequences. | --maxiterate 0 |
--memsave | Memory saving but approximately two times slower. | auto |
--fft | For long (∼1,000,000 nt) conserved sequences. | auto |
--nofft | For a large number of sequences (∼5,000). | auto |
--parttree | For extremely large numbers of sequences (>100,000). | disabled |
--dpparttree | For extremely large numbers of sequences (>100,000). | disabled |
--fastaparttree | For extremely large numbers of sequences (>100,000). | disabled |
--partsize 1000 | More accurate than default.
Valid with the --*parttree options. |
--partsize 50 |
--groupsize 1 | Does not align. Recommended to be used with --reorder.
Valid with the --*parttree options. The sequences will be sorted according to similarity. |
--groupsize (large) |
--memsavetree | Uses less memory space to build a guide tree. New | disabled |
--pileup | Aligns the sequences in the input order. New | disabled |
--randomchain | Aligns the sequences in a randomized order. New | disabled |
--randomseed | Seed of random numbers. New | 0 |
--treeout | Outputs guide tree into the in.tree file in the current directory. | disabled |