RIMD, Osaka Univ. Osaka Univ.
MAFFT version 7

Multiple alignment program for amino acid or nucleotide sequences

This feature is supported in versions ≥7.120.

Non-biological sequences

Non-biological sequences, or texts consisting of printable characters, can be aligned in the --text mode. 

Input:

> text 1
2008~KATO~Toh
> text 2
2005~Katoh~Kuma~MIYATA~Toh

The simplest command:

% mafft --text input > output

Other options are also available in this mode.

% mafft-ginsi --text input > output
% mafft --text --clustalout input > output
etc
BUG!! In versions <7.369, the combination with --globalpair or --localpair sometimes failed. 
 

BUG!! In versions 7.395 – 7.409, the combination with --clustalout did not work.  This bug will be fixed soon (2019/Jan).
 

Output of --text:

2008~K-------------ATO~Toh
2005~Katoh~Kuma~MIYATA~Toh
*** **             ** ****

Extended alphabet

Version 7.270 and higher accept extended characters such as ö and ä. 
------Northern_part_of----_Cha_das_Caldeiras,_near_Fernao_Gomes.
------Northern_part_of_the_Cha_das_Caldeiras-_near_Fernão_Gomes-
----------------------------------------------Near_Fernão_Gomes-
Fógo:_northern_part_of_the_Cha_das_Caldeiras._------------------

Text has to be 8-bit encoded, like LATIN1, Windows-1252, Mac OS Roman.  UTF8 can be converted to/from LATIN1 by the iconv program on Linux, if the text uses Western European alphabets only.

% iconv -f UTF-8 -t LATIN1 input.utf8 > input.latin1
% mafft --text input.latin1 > output.latin1
% iconv -f LATIN1 -t UTF-8 output.latin1 > output.utf8

248 alphabets In alpha testing, 2018/Feb

The --text mode actually accepts characters 0x01 – 0xFF excluding > (0x3E), = (0x3D), < (0x3C), - (0x2D), Space (0x20), Carriage Return (0x0d) and Line Feed (0x0a).  So the maximum size of alphabet is 248. 

Two format converters, hex2maffttext and maffttext2hex, to easily handle 248 alphabets are bundled in versions ≥7.390

Usage:

(1) Prepare an input file, input.hex, in hexadecimal code (in the range explained above) using space as separator.  Title of each sequence should be marked by >

>sequence1
01 02 03 4e 6f 72 74 68 65 72 6e 5f 70 61 ...
>sequence2
01 02 03 4e 6f 72 74 68 65 72 6e 5f 70 61 ...
>sequence3
a3 6f 5f 47 6f 6d 65 73 ...
>sequence4
01 02 03 46 c3 b3 67 6f 3a 5f 6e 6f 72 74 68 65 72 6e 5f 70 61 ...

(2) Convert this file to ASCII code (including printable characters and control characters):

% /usr/local/libexec/mafft/hex2maffttext input.hex > input.ASCII

On Windows PowerShell, which uses UTF-16 by default, necessary to convert to ASCII by two steps: 2022/Aug
PS C:\somewhere> usr\lib\mafft\hex2maffttext input.hex > input.utf16
PS C:\somewhere> Get-Content input.utf16 | Set-Content -Encoding ASCII input.ASCII

(3) Run mafft --text

% mafft --text --clustalout input.ASCII > output.ASCII

(4) The output can be converted back to hexadecimal notation by:

% /usr/local/libexec/mafft/maffttext2hex output.ASCII > output.hex

Result:

CLUSTAL format alignment by MAFFT (v7.390)

sequence1       01 02 03 -- -- -- -- -- -- -- 4e 6f 72 74 68 65 72 6e 5f 70 61  ...
sequence2       01 02 03 -- -- -- -- -- -- -- 4e 6f 72 74 68 65 72 6e 5f 70 61  ...
sequence3       -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --  ...
sequence4       01 02 03 46 c3 b3 67 6f 3a 5f 6e 6f 72 74 68 65 72 6e 5f 70 61  ...
                                                                            

sequence1       61 6f 5f 47 6f 6d 65 73 2e 
sequence2       a3 6f 5f 47 6f 6d 65 73 -- 
sequence3       a3 6f 5f 47 6f 6d 65 73 -- 
sequence4       -- -- -- -- -- -- -- -- -- 

User-defined scoring matrix for 248 alphabets In alpha testing

A non-default scoring matrix can be used in versions 7.371 or higher.
% mafft --textmatrix matrixfile input > output

The format of  matrixfile  is:

0x01 0x01 2   # (comment)
0x1e 0x1e 2
0x1f 0x1f 2
0x21 0x21 2   # ! × !
0x41 0x41 2   # A × A
0x42 0x42 2   # B × B
0x43 0x43 2   # C × C
0x44 0x44 2   # D × D
0x30 0x30 2   # 0 × 0
0x31 0x31 2   # 1 × 1
0x32 0x32 2   # 2 × 2
0x33 0x33 2   # 3 × 3
0x34 0x34 2   # 4 × 4

0x41 0x30 0.5 # A × 0
0x30 0x41 0.5 # 0 × A (Unnecessary in versions ≥ 7.400)

0x42 0x31 0.5 # B × 1
0x31 0x41 0.5 # 1 × B (Unnecessary in versions ≥ 7.400)

0x46 0x35 0.5
0x35 0x46 0.5 (Unnecessary in versions ≥ 7.400)
Not necessary to give all of 248x248 pairs.  If a score for a pair is given in this file, the score overrides the default one for the pair.  If a pair does not appear in this file, then the default score is used for the pair.  Texts after '#' are ignored.

In versions < 7.400, a mismatch score between letters p and q (pq) had to be specified twice (ie, p×q and q×p).  In versions ≥ 7.400, this is unnecessary; if a score for p×q is set, then the same score is used for q×p, too.

Difference between --text and --anysymbol

The --anysymbol option also accepts input data with non-alphabetical symbols.  This option is for amino acid or nucleotide sequences that contain unusual symbols, such as U and i.  Input sequences are interpreted as amino acid or nucleotide sequences, unlike --text.

An alignment by --anysymbol:


-------------2008~KATO~Toh
2005~Katoh~Kuma~MIYATA~Toh
                   **  * *