RIMD, Osaka Univ. Osaka Univ.
MAFFT version 7

Multiple alignment program for amino acid or nucleotide sequences

Rapid calculation of full-length MSA of closely-related viral genomes  Experimental (2020/Apr/11 )

When the input data set is large and the sequences are very closely related (% identity ∼ 99), it's sometimes useful to align all sequences just to a reference to build a full MSA.  Time complexity is O(N L log L), where N is the number of sequences and L is sequence length. 

Online version supports more than 20,000 sequences × ∼30,000 sites, 2021/Jan. 

Online version supports up to ∼ 20,000 sequences × ∼30,000 sites, 2020/Apr/18. 

Reduced the frequency of "timeout" error in data transfer, 2020/Jul/27.

On command line, use version 7.467 or later.  Earlier versions (≤7.458) had the same options but were inefficient for this purpose. 

Online

Input page  Updated 2022/Mar

Procedure:

  1. Select a single reference sequence or a reference MSA (a small set of sequences already aligned).
  2. Input the reference (eg, NC_045512.2) to the "Reference" box.
  3. Input the other sequences to the "New sequence(s)" box.
  4. Select options (Adjust direction, Keep alignment length, etc) as necessary.
  5. Submit

Note that:

Command Updated 2021/Jan/29

After dividing the input sequences into reference and others, type:
% mafft --6merpair --addfragments othersequences referencesequence > output

To keep the numbering of sites,

% mafft --6merpair --keeplength --addfragments othersequences referencesequence > output
(sometimes faster than the default).

If the input data contains many ambiguous letters, try:

% mafft --6merpair --maxambiguous 0.05 --addfragments othersequences referencesequence > output
which removes sequences that have more than 5% ambiguous letters.  Available in versions ≥7.473.

Runs efficiently in parallel,

% mafft --6merpair --thread -1 --keeplength --addfragments othersequences referencesequence > output

Without the --6merpair flag, too slow for this purpose.

A similar option, --add, is not efficient for this purpose, but suitable when the input sequences are less closely related, the sequences to be added are fewer and a reference MSA is available.