cbrc
MAFFT version 7

Multiple alignment program for amino acid or nucleotide sequences

Rapid calculation of full-length MSA of closely-related viral genomes  Experimental (2020/Apr/11 )

When the input data set is large and the sequences are very closely related (% identity ∼ 99), it's sometimes useful to align all sequences just to a reference to build a full MSA.  Time complexity is O(N L log L), where N is the number of sequences and L is sequence length. 

Online version supports up to ∼20,000 sequences × ∼30,000 sites, 2020/Apr/18.  Reduced the frequency of "timeout" error in data transfer, 2020/Jul/27.

On command line, use version 7.467 or later.  Earlier versions (≤7.458) had the same options but were inefficient for this purpose. 

Online

Input page (addfragments)

Procedure:

  1. Select a single reference sequence or a reference MSA (a small set of sequences already aligned).
  2. Input the reference to the "Existing alignment" box.
  3. Input the other sequences to the "Fragmentary sequence(s)" box.
  4. Select options (Adjust direction, Keep alignment length, etc) as necessary.
  5. Submit

Note that:

Command Updated  2020/Apr/18

After dividing the input sequences into reference and others, type:
% mafft --auto --addfragments othersequences referencesequence > output

To keep the numbering of sites,

% mafft --auto --keeplength --addfragments othersequences referencesequence > output
(sometimes faster than the default).

Runs efficiently in parallel,

% mafft --auto --thread -1 --keeplength --addfragments othersequences referencesequence > output

Without the --auto flag, too slow for this purpose.

A similar option, --add, is not efficient for this purpose, but suitable when the input sequences are less closely related, the sequences to be added are fewer and a reference MSA is available.