MPI-parallelization of MAFFT

MPI version of high-accuracy progressive options, [GLE]-large-INS-1

Download, Compile, Install

Source package only
Versions ≥7.350 support MPI. Linux only.

Environmental variables

Two environmental variables, MAFFT_N_THREADS_PER_PROCESS and MAFFT_MPIRUN, have to be set.

An example to use 160 cores (16 cores × 10 hosts):

The number of threads to run in a process:
$ export MAFFT_N_THREADS_PER_PROCESS="1"
Set "1" unless using a MPI/Pthreads hybrid mode.

Location of mpirun/mpiexec and options:
$ export MAFFT_MPIRUN="/somewhere/bin/mpirun -n 160 -npernode 16 -bind-to none ..." (for OpenMPI)
$ export MAFFT_MPIRUN="/somewhere/bin/mpirun -n 160 -perhost  16 -binding none ..." (for MPICH)
mpirun or mpiexec must be from the same library as mpicc that was used in compiling.

Depending on the configuration of your cluster, LD_LIBRARY_PATH may be necessary to set.
$ export LD_LIBRARY_PATH="/somewhere/lib"

(Optional) Location of temporary directory (see below):
$ export MAFFT_TMPDIR="/location/of/shared/filesystem/"

To avoid typing these commands each time, try batch, in which parameters are easily set.

Command

Add "--mpi --large" to the normal command of G-INS-1, L-INS-1 or E-INS-1.
G-large-INS-1:
$ mafft --mpi --large --globalpair --thread 16 input 

L-large-INS-1:
$ mafft --mpi --large --localpair --thread 16 input 

E-large-INS-1:
$ mafft --mpi --large --genafpair --thread 16 input 

E-large-INS-1 (old parameters):
$ mafft --mpi --large --oldgenafpair --thread 16 input 
The --thread flag specifies the maximum number of threads (16 in these examples) used in step 2 (see below) and other calculations performed on a single node. It must be less than or equal to the number of physical cores in a single host. (Changed from --threadtb to --thread, 2018/Sep/22)

Batch (Not yet fully tested)

To set the environmental variables and run the command at a time, a batch script can be used.
$ sh mpionly.noscheduler
Edit the mpionly.noscheduler script according to your cluster's environment, and run it. Detailed information about the variables are explained in the batch file itself.
For job schedulers, edit and run one of the following templates. If unsuccessful, try the script above without scheduler for small input sequence data, to identify the cause.
LSF:
$ bsub < mpionly.lsf
Many cluster with the SGE/UGE scheduler have a parallel environment (PE) to use a specific number of cores per host. An example with a PE name mpi16:
$ qsub mpionly.uge
Ask system administrator if there is an appropriate PE.
Otherwise, disable multithread.
$ qsub singlethread.uge
In this case, step 2 runs in serial.
PBS:
$ qsub mpionly.pbs
SLURM:
preparing

Tips

Can be applied to >10,000 sequences. The upper limit depends on disk space (?). Not efficient for small data.
More accurate than high-speed options, such as FFT-NS-2.
The calculation of [GLE]-large-INS-1 consists of two phases that are differently parallelized:

All-to-all alignment step → MPI (+ pthread, optionally)
Progressive alignment step → pthread

Accordingly, parameters (the number of threads, etc) for these two steps can be independently configured:

All-to-all alignment step → MAFFT_N_THREADS_PER_PROCESS and MAFFT_MPIRUN, explained above
Progressive alignment step → --threadit option

CPU time for step 1 is much larger than that of step 2 when the number of sequences is large. The parallelization efficiency is generally high in step 1 but low in step 2 in this implementation. Thus it is not necessary to give a very large value to --threadit.
Wall-clock times in three cases are shown in the figure below. Panels a, c and e → step 1; Panels b, d and f → step 2.
The location of temporary directory can be specified by the MAFFT_TMPDIR environmental variable. If this is not set, $HOME/maffttmp/ is automatically created and used as temporary directory. The temporary directory must be shared by all hosts. If your system has high-speed shared filesystem, such as Lustre, then use it as temporary directory.
```
% setenv MAFFT_TMPDIR /location/of/shared/filesystem/ (tcsh, csh, etc)
$ export MAFFT_TMPDIR=/location/of/shared/filesystem/ (bash, zsh, etc)
```
In the cases of the plots above, the efficiency with Lustre temporary directory is higher than that with NFS, especially in panel c (short sequences).
Size of the data stored in the temporary directory is proportional to N², where N is the number of sequences. It also depends on the similarity. In one case, when N∼80,000, the data size ∼150GB. Be careful about disk space and quota.
The size of temporary data can be reduced by the --limitlh n option. Its effect on the accuracy is being tested.
The temporary data is removed when the calculation ends normally or is terminated by Ctrl-C or the kill command. However, when terminated by job scheduler, the data may remain unless regular cleanup process removes it.

Problem reports

are welcome.