This feature is supported in versions ≥7.120
Non-biological sequences, or texts consisting of printable characters, can be aligned in the --text
> text 1
> text 2
Most printable characters, except for =*()<>, are accepted in the input sequences.
Non-printable characters will become acceptable in the future.
Spaces and tabs in the input sequences are removed.
They have to be converted to another symbol (~ in the above example) if necessary.
The present version supports the simplest scoring matrix only:
User-defined scoring matrix for this mode will become acceptable in the future.
Match → a positive value
Mismatch → a netative value
All the acceptable characters are used to compute the alignment score.
The simplest command:
% mafft --text input > output
Other options are also available in this mode.
% mafft-ginsi --text input > output
% mafft --text --clustalout input > output
Output of --text:
*** ** ** ****
Extended alphabet In alpha testing
Version 7.270 and higher
accept extended characters such as ö and ä.
Text has to be 8-bit encoded, like LATIN1, Windows-1252, Mac OS Roman.
UTF8 can be converted to/from LATIN1 by the iconv program on Linux, if the text uses Western European alphabets only.
% iconv -f UTF-8 -t LATIN1 input.utf8 > input.latin1
% mafft --text input.latin1 > output.latin1
% iconv -f LATIN1 -t UTF-8 output.latin1 > output.utf8
The acceptable characters in these versions are 0x00-0xFF excluding > (0x3E), = (0x3D), < (0x3C), - (0x2D), Space (0x20), Carriage Return (0x0d), Line Feed (0x0a) and NULL (0x00).
So the maximum size of alphabet should be 248.
If non-text data is mapped to this range of characters, the data can be aligned by mafft --text.
Not tested yet
Difference between --text and --anysymbol
option also accepts input data with non-alphabetical symbols.
This option is for amino acid or nucleotide sequences that contain unusual symbols, such as U
Input sequences are interpreted as amino acid or nucleotide sequences, unlike --text
An alignment by --anysymbol:
** * *
Non-amino acid characters are treated as unknown.
In this example,
2, 0, 5, 8, ~, u, O and o are not considered in the alignment calculation.
Thus 200 at the beginning of the sequences is not aligned.
Upper- and lowercase letters (A-a and T-t) are not distinguished.