--
IsabelCampos - 03 Apr 2009
Site configuration for MPI
This document is intended to help EGEE site administrators to properly support
MPI deployments. The recomendations for supporting MPI on EGEE where drafted by
the Technical Coordination Group in MPI and can be found on
on this link
.
Installation of MPI
The two MPI flavours currently supported are
OpenMPI and
MPICH
You can build your own package by downloading the source RPM from the
Open MPI downloads page
.
If you whish to use any special interconnect, like Infiniband, ensure to enable it when
compiling by passing the
--with-openib
option to the configure
script. More details about this topic can be found here:
configuration of Infiniband Switches for glite
If your batch system is Torque, it is recommended to enable the Torque TM subsystem (thus
getting proper accounting numbers) by passing the
--enable-mca-dso=pls-tm
option.
Furthermore, if you wish to use the Intel Compiler, you must define the
CC
,
CXX
,
FC
and
F77
variables to point to the
correct compilers.
# export F77=ifort
# export FC=ifort
# export CC=icc
# export CXX=icpc
Compile and build your RPM following
# rpmbuild -ba openmpi-1.2.5.spec -D 'name openmpi' -D '_packager name' -D 'configure_options --with-openib --enable-mca-dso=pls-tm'
Once compiled, you can deploy this package to all of your WNs.
MPICH
TBD
MPI-Start
MPI-Start is a recommended solution to hide the implementation details for the
submitted jobs. It was developed inside the
Int.EU.Grid project
and can be downloaded from
here
.
It should be installed on every node involved with MPI. We recomend to install it in the Worker Nodes, in the Computing Element
and in the User Interface (for user testing purposes) at least.
Batch system configuration
Here you can find the instructions to manually configure different batch systems to execute MPI jobs. There is a yaim module that can perform automatic configuration for PBS/Torque schedulers. You can find more information about this module below.
PBS-based schedulers
PBS-based schedulers such as Torque do not deal properly with CPU allocations,
because they assume homogeneous systems with the same number of CPUs for all
the nodes (machines).
$cpu_per_node
can be set in the jobmanager,
but it has to be the same for all the machines. Furthermore, PBS doesn't seem
to understand that there might be processes running in 1 CPU of each machine
of 2 CPUs in a farm, so there are still half the capacity free for more jobs.
For these reasons, it is needed to add some special configuration to the batch
system.
Torque
Edit (create it if it does note exist) your torque configuration file (
/var/spool/pbs/torque.cfg
) and add a line containing:
SUBMITFILTER /var/spool/pbs/submit_filter.pl
Then download the submit_filter.pl from
here
and put it in the above
location.
This filter modifies the script coming from the submission, rewriting the
-l nodes=XX
option with specific requests, based on the
information given by
pbsnodes -a
command.
The submit filter is crucial. Failing to use the submit filter translates in the job being submitted to only
one node, where all the MPI processes are allocated too, instead of distributing the job across several nodes.
Warning: glite updates tend to rewrite torque.cfg. Check that the submit filter line is still there after performing an update
Maui
Edit your configuration file (usually under
/var/spool/maui/maui.cfg
) and check that it contains the
following lines:
ENABLEMULTINODEJOBS TRUE
ENABLEMULTIREQJOBS TRUE
These parameter allows a job to span to more than one node and to specify
multiple independent resource requests.
Sun Grid Engine (SGE)
MPI support for the lcg-CE with SGE is still under development, however, it is possible configure MPI in a lcg-CE supporting SGE in production on your own risk following these instructions.
Note: In this example we do assume that SGE batch system is already configured and it is working correctly. You can get information about SGE configuration and how to configure it with lcg-CE or CREAM-CE on the
SGE CookBook V2-1
.
Install the necessary
I2G MPI RPMs on the CE and on all the necessary WNs:
# i2g-openmpi-1.2.2-6
# i2g-mpi-start-0.0.58-1

The package i2g-mpi-start install the following files
mpi_start.csh
mpi_start.sh
on
/etc/profile.d/
. This should be taken into account. In this example, these files were deleted because we load the MPI environment in other way (it is explained in th the next steps).
Note: In this example, we decided to install the
OpenMPI version compiled for the
I2G project
i2g-openmpi-1.2.2-6, but it's possible to use another
OpenMPI version more recent and compiled on your own.
After installing the MPI RPMS, it is necessary installing the last version for the
JobManager which supports MPI. The
JobManager last version can be downloaded from this link,
SGE JobManager V0.53
.
Install the new
JobManager version:
tar -zxvf sge_job_manager.tar.gz
cd sge_job_manager/JobManager/
cp -fr lcgsge.pm /opt/globus/lib/perl/Globus/GRAM/JobManager/lcgsge.pm
Restart the services on the CE.

It is recommendable do backup to the previous
JobManager /opt/globus/lib/perl/Globus/GRAM/JobManager/lcgsge.pm
Once the
OpenMPI, i2g-mpi-start, and
JobManager packages are installed it is time to create a SGE parallel environment and configure it on all the necessary queues.
• Add a parallel environment:
qconf -ap parallel_environment_name
One example:
pe_name openmpi_egee
slots 16
user_lists NONE
xuser_lists NONE
start_proc_args /bin/true
stop_proc_args /bin/true
allocation_rule $round_robin
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary FALSE

On variable
slots it is configured the number total of nodes that can be used by a MPI job. Variables
start_proc_args and
stop_proc_args are left with the value by default "/bin/true", because kick-off and stop of MPI sub tasks is done by the i2g-mpi-start.
Configure the parallel environment created on the necessary queues in order to have a queue to run paralleljobs:
qconf -mq queue_name
- To modify an existing queue, add the parallel environment create on the variable
pe_list
One example:
[root@ce3 ~]# qconf -sq GRID_ops
qname GRID_ops
[ ... ]
pe_list openmpi_egee
[ ... ]

With the command
qconf -sq queue_name, we can see the configuration for a determinate queue.

To publish the correct MPI values changes must be done by hand at the moment, there is not a
YAIM function available to do this yet.
Add variables to the environment:
Adding at the end of
/etc/profile.d/grid-env.sh
file on the CE the following lines:
gridenv_set "MPI_OPENMPI_VERSION" "1.2.2" # The OPENMPI version installed on the CE
gridenv_set "I2G_MPI_START" "/opt/i2g/bin/mpi-start" # The PATH where is installed mpi-start script, by default this one
gridenv_set "MPI_OPENMPI_PATH" "/opt/i2g/openmpi/" # The PATH where is installed OPENMPI
gridenv_set "MPI_SSH_HOST_BASED_AUTH" "yes" #
gridenv_set "MPI_SHARED_HOME" "no" # If you have a shared file-system area between WNs, you can publish with:
gridenv_set "MPI_MPICC_OPTS" "-m32"
gridenv_set "MPI_MPICXX_OPTS" "-m32"
gridenv_set "MPI_MPIF77_OPTS" "-m32"

Explanation of these variables:
MPI_SHARED_HOME set this to "yes" if you have a shared home area between WNs.
MPI_SSH_HOST_BASED_AUTH If you do NOT have SSH Hostbased Authentication between your WNs set this to "no"
MPI__MPIEXEC If you are using OSC mpiexec with mpich, set this to the location of the mpiexec program, e.g. "/usr/bin/mpiexec"
MPI_MPI_START Location of mpi-start if not installed in standard location /opt/i2g/bin/mpi-start
In this example, we are using the tar-WN version on the WNs, so we use a file
/opt/cesga/lcg-wn/external/etc/profile.d/grid-env.sh
configured on the
JobManager (see SGE
CookBook to get more information) to load the correct environment for each job. So we added at the end of
/opt/cesga/lcg-wn/external/etc/profile.d/grid-env.sh
file the following variables:
gridenv_set "MPI_OPENMPI_VERSION" "1.2.2"
gridenv_set "I2G_MPI_START" "/opt/i2g/bin/mpi-start"
gridenv_set "MPI_OPENMPI_PATH" "/opt/i2g/openmpi/"
gridenv_set "MPI_SSH_HOST_BASED_AUTH" "yes"
gridenv_set "MPI_SHARED_HOME" "no"
gridenv_set "MPI_MPICC_OPTS" "-m32"
gridenv_set "MPI_MPICXX_OPTS" "-m32"
gridenv_set "MPI_MPIF77_OPTS" "-m32"

If you are not using the tar-WN version, it would be the same process that for the CE, in other words, add these variables at the end of
/etc/profile.d/grid-env.sh
file.
Adding the variables to publish MPI support on site-bdii, modifying the file
/opt/glite/etc/gip/ldif/static-file-Cluster.ldif
on the CE:
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.2
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI
GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SSH_HOST_BASED_AUTH
ldapsearch -x -H ldap://ce2.egee.cesga.es:2170 -b mds-vo-name=resource,o=grid | grep -i mpi
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START
GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SSH_HOST_BASED_AUTH
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.2

, As explained before, you should add the variable
GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME
to publish on the site-bdii, if you have a shared home area between WNs. In this example, we don't have a shared home area between WNs.

Testing:
Testing the environment
[dteam001@compute-2-8 ~]$ env|grep MPI_
MPI_SSH_HOST_BASED_AUTH=yes
MPI_OPENMPI_PATH=/opt/i2g/openmpi/
MPI_OPENMPI_VERSION=1.2.2
MPI_SHARED_HOME=no
I2G_MPI_START=/opt/i2g/bin/mpi-start
Submitting a job directly to the SGE batch system:
qsub -l num_proc=1 -pe openmpi_egee 2 test.sh
Note: In this example, it is requested 2 nodes with one processor each one.
test.sh content:
#!/bin/bash
MPI_FLAVOR=OPENMPI
MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`
# Path where is installed the MPI package
export MPI_PATH=/opt/i2g/openmpi
# Ensure the prefix is correctly set. Don't rely on the defaults.
eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
export I2G_${MPI_FLAVOR}_PREFIX
export X509_USER_PROXY=/tmp/x509up_u527
export I2G_TMP=/tmp
export I2G_LOCATION=/opt/i2g
#export I2G_OPENMPI_PREFIX=/opt/i2g/openmpi
export I2G_MPI_TYPE=openmpi
export I2G_MPI_FLAVOUR=openmpi
#PATH to the application that we want to run
## This application should be copied in the WN, in this example it was copied to the /tmp directory
export I2G_MPI_APPLICATION=/tmp/cpi
export I2G_MPI_APPLICATION_ARGS=
export I2G_MPI_NP=2
export I2G_MPI_JOB_NUMBER=0
export I2G_MPI_STARTUP_INFO=/home/glite/dteam004
export I2G_MPI_PRECOMMAND=
export I2G_MPI_RELAY=
# PATH where is installed de mpi-start RPM
export I2G_MPI_START=/opt/i2g/bin/mpi-start
export I2G_MPI_START_DEBUG=1
export I2G_MPI_START_VERBOSE=1
$I2G_MPI_START
Submitting the previous job using a WMS:
Building the JDL file:
esfreire@ui mpi_egee]$ cat lanzar_cpi_mpi.jdl
JobType = "Normal";
VirtualOrganisation = "dteam";
NodeNumber = “2;
# We use a wrapper script to starting the MPI job
Executable = "mpi-start-wrapper.sh";
Arguments = "cpi OPENMPI";
StdOutput = "cpi.out";
StdError = "cpi.err";
InputSandbox = {"cpi","mpi-start-wrapper.sh"};
OutputSandbox = {"cpi.out","cpi.err"};
# In this example, we are requesting available queue on one of CEs in CESGA-EGEE production
Requirements = other.GlueCEUniqueID == "ce2.egee.cesga.es:2119/jobmanager-lcgsge-GRID_dteam"
mpi-start-wrapper.sh content:
[esfreire@ui mpi_egee]$ cat mpi-start-wrapper.sh
#!/bin/bash
# Pull in the arguments.
MY_EXECUTABLE=`pwd`/$1
MPI_FLAVOR=$2
# Convert flavor to lowercase for passing to mpi-start
MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`
# Pull out the correct paths for the requested flavor.
eval MPI_PATH=`printenv MPI_${MPI_FLAVOR}_PATH`
export MPI_PATH=$MPI_PATH
# Ensure the prefix is correctly set. Don't rely on the defaults.
eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
export I2G_${MPI_FLAVOR}_PREFIX
# Touch the executable. It exist must for the shared file system check.
# If it does not, then mpi-start may try to distribute the executable
# when it shouldn't.
touch $MY_EXECUTABLE
chmod +x $MY_EXECUTABLE
# Setup for mpi-start.
export I2G_MPI_APPLICATION=$MY_EXECUTABLE
export I2G_MPI_APPLICATION_ARGS=
export I2G_MPI_TYPE=$MPI_FLAVOR_LOWER
# optional hooks
#export I2G_MPI_PRE_RUN_HOOK=mpi-hooks.sh
#export I2G_MPI_POST_RUN_HOOK=mpi-hooks.sh
# If these are set then you will get more debugging information.
#export I2G_MPI_START_VERBOSE=1
#export I2G_MPI_START_DEBUG=1
# Invoke mpi-start.
$I2G_MPI_START
Submitting the job:
glite-wms-job-submit -a -o mpi.job lanzar_cpi_mpi.jdl
Downloading the job results:
glite-wms-job-output -i mpi.job --dir ~/jobOutput/
[esfreire@ui mpi_egee]$ cat ../jobOutput/cpi.out
Pre MPI_init
Pre MPI_init
Pre MPI_init
Pre MPI_init
Pre MPI_Comm_size
Pre MPI_Comm_size
Pre MPI_Comm_rank
Pre MPI_Get_processor_name
Process 0 of 4 is on compute-2-1.local
Pre MPI_Comm_size
Pre MPI_Comm_rank
Pre MPI_Get_processor_name
Process 1 of 4 is on compute-2-1.local
Pre MPI_Comm_rank
Pre MPI_Get_processor_name
Process 2 of 4 is on compute-2-0.local
Pre MPI_Comm_size
Pre MPI_Comm_rank
Pre MPI_Get_processor_name
Process 3 of 4 is on compute-2-0.local
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.012632
Note: In both examples, we are using a compiled application called “cpi” what calculates PI number.
Configuration of the Worker nodes
It is neccessary to use either a
shared storage area (i.e.
$HOME or a
scratch dir) in all the nodes *or set up a passwordless SSH
access* (i.e. hostbased access) between them.
Each one has its pros and cons, so its up to you which one to choose depending
on your site hardware & software details.
Environment variables
These environment variables should be set for jobs executing on a worker node in an MPI site. This is normally done by adding a script to /etc/profile.d. The environment variables should be a straight mapping from the environment variables.
All prefixed with MPI_:
Mandatory
-
MPI_<flavour>_VERSION
-
MPI_<flavour>_PATH
Examples:
-
MPI_OPENMPI_VERSION=1.2.5
-
MPI_OPENMPI_PATH=/opt/i2g/openmpi
mpiexec
Some sites use
mpiexec
as it uses the scheduler interface directly to execute multi-node jobs, and so usage is accounted correctly. Sites which have this installed should set the following environment variable to the top directory in their mpiexec installation: e.g.
MPI_MPICH_MPIEXEC=/opt/mpiexec-0.80
Support for this format was added to i2g-mpi-start as of version 0.0.46.
Optional
-
MPI_<flavour>_COMPILER
-
MPI_<flavour>_<version>_PATH
-
MPI_<flavour>_<version>_<compiler>_PATH
-
MPI_INTERCONNECT=<interconnect>
Configuration of the Information System
Sites may install different implementations (or flavours) of MPI. It is expected that users will converge on
OpenMPI in the future, but for the moment, a variety of libraries and tools are in use. It is important therefore that users can use the information system to locate sites with the software they require. You should publish some values to let the world know which flavour of MPI you
are supporting, as well as the interconnect and some other things. Everything
related with MPI should be published as
GlueHostApplicationSoftwareRunTimeEnvironment
in the corresponding sections.
Publishing MPI-START
If you support MPI-START publish it with:
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START
Publishing your MPI Flavour
Publish which flavour, version of MPI you are using. Eventually you can also specify
the compiler and version. MPI flavours are MPICH, MPICH2 and OPENMPI. For example:
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.5
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.5-ICC
Publishing your MPI Interconnect
If you have any special Interconnect (like Infiniband, or Myrinet) you can
publish it like:
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-Infiniband
Publish other features related to MPI
If you have a shared filesystem area between WNs, you can publish with:
GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME
Yaim Configuration
A
YAIM module for MPI configuration is available as part of the gLite distribution (
glite-yaim-mpi
). This module will perform automatic configuration for PBS/Torque schedulers, enabling the use of submit filters, definition of environment variables and configuration of the information system. If you have another batch system or prefer to do the configuration manually, you can find the instructions above.
Configure MPI in site-info.def
Individual "flavours" of MPI are enabled by setting the associated variable to "yes". For example, to enable Open MPI, add the following:
MPI_OPENMPI_ENABLE="yes"
You may set the path and version for a flavour of MPI as follows:
MPI_OPENMPI_PATH="/opt/i2g/openmpi/"
MPI_OPENMPI_VERSION="1.2.5"
The remaining variables are:
Variable |
meaning |
MPI_SHARED_HOME |
set this to "yes" if you have a shared home area between WNs. |
MPI_SSH_HOST_BASED_AUTH |
If you do NOT have SSH Hostbased Authentication between your WNs set this to "no" |
MPI__MPIEXEC |
If you are using OSC mpiexec with mpich, set this to the location of the mpiexec program, e.g. "/usr/bin/mpiexec" |
MPI_MPI_START |
Location of mpi-start if not installed in standard location /opt/i2g/bin/mpi-start |
Here is an example configuration:
#----------------------------------
# MPI-related configuration:
#----------------------------------
# Several MPI implementations (or "flavours") are available.
# If you do NOT want a flavour to be installed/configured, set its variable
# to "no". Else, set it to "yes" (default). If you want to use an
# already installed version of an implementation, set its "_PATH" and
# "_VERSION" variables to match your setup (examples below).
#
# NOTE 1: the CE_RUNTIMEENV will be automatically updated in the file
# functions/config_mpi, so that the CE advertises the MPI implementations
# you choose here - you do NOT have to change it manually in this file.
# It will become something like this:
#
# CE_RUNTIMEENV="$CE_RUNTIMEENV
# MPI_MPICH
# MPI_MPICH2
# MPI_OPENMPI
# MPI_LAM"
#
# NOTE 2: it is currently NOT possible to configure multiple concurrent
# versions of the same implementations (e.g. MPICH-1.2.3 and MPICH-1.2.7)
# using YAIM. Customize "/opt/glite/yaim/functions/config_mpi" file
# to do so.
MPI_OPENMPI_ENABLE="yes"
MPI_MPICH_ENABLE="yes"
MPI_MPICH2_ENABLE="yes"
MPI_LAM_ENABLE="yes"
#---
# Example for using an already installed version of MPI.
# Setting "_PATH" and "_VERSION" variables will prevent YAIM
# from downloading and installing the gLite-provided packages.
# Just fill in the path to its current installation (e.g. "/usr")
# and which version it is (e.g. "6.5.9").
#---
MPI_OPENMPI_PATH="/opt/i2g/openmpi/"
MPI_OPENMPI_VERSION="1.2.5"
MPI_MPICH_PATH="/opt/mpich-1.2.7p1/"
MPI_MPICH_VERSION="1.2.7p1"
MPI_MPICH2_PATH="/opt/mpich2-1.0.4/"
MPI_MPICH2_VERSION="1.0.4"
MPI_LAM_VERSION="7.1.2"
# If you do NOT provide a shared home, set $MPI_SHARED_HOME to "no" (default).
#
MPI_SHARED_HOME="yes"
#
# If you do NOT have SSH Hostbased Authentication between your WNs, set the below
# variable to "no" (default). Else, set it to "yes".
#
MPI_SSH_HOST_BASED_AUTH="no"
#
# If you provide an 'mpiexec' for MPICH or MPICH2, please state the full path to
# that file here (http://www.osc.edu/~pw/mpiexec/index.php). Else, leave empty.
#
#MPI_MPICH_MPIEXEC="/usr/bin/mpiexec"
MPI_MPICH_MPIEXEC="/usr/bin/mpiexec"
Configure CE
/opt/glite/yaim/bin/yaim -c -s site-info.def -n MPI_CE
Configure WN
For a Torque worker node:
/opt/glite/yaim/bin/yaim -c -s site-info.def -n MPI_WN -n glite-WN -n TORQUE_client
Testing configuration
You can do some basic tests by logging in on a WN as a pool user and running the following:
[dte056@cagnode48 dte056]$ env|grep MPI_
You should see something like this:
MPI_MPICC_OPTS=-m32
MPI_SSH_HOST_BASED_AUTH=yes
MPI_OPENMPI_PATH=/opt/openmpi/1.1
MPI_LAM_VERSION=7.1.2
MPI_MPICXX_OPTS=-m32
MPI_LAM_PATH=/usr
MPI_OPENMPI_VERSION=1.1
MPI_MPIF77_OPTS=-m32
MPI_MPICH_VERSION=1.2.7
MPI_MPIEXEC_PATH=/opt/mpiexec-0.80
MPI_MPICH2_PATH=/opt/mpich2-1.0.4
MPI_MPICH2_VERSION=1.0.4
I2G_MPI_START=/opt/i2g/bin/mpi-start
MPI_MPICH_PATH=/opt/mpich-1.2.7p1
You can also try submitting a job to your site using the instructions found below.
Submission of MPI Jobs
In order to invoke MPI-START you need a wrapper script that sets the environment variables that define your job. This script is generic and should not need to have significant modifications made to it.
#!/bin/bash
# Pull in the arguments.
MY_EXECUTABLE=`pwd`/$1
MPI_FLAVOR=$2
# Convert flavor to lowercase for passing to mpi-start.
MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`
# Pull out the correct paths for the requested flavor.
eval MPI_PATH=`printenv MPI_${MPI_FLAVOR}_PATH`
# Ensure the prefix is correctly set. Don't rely on the defaults.
eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
export I2G_${MPI_FLAVOR}_PREFIX
# Touch the executable. It exist must for the shared file system check.
# If it does not, then mpi-start may try to distribute the executable
# when it shouldn't.
touch $MY_EXECUTABLE
chmod +x $MY_EXECUTABLE
# Setup for mpi-start.
export I2G_MPI_APPLICATION=$MY_EXECUTABLE
export I2G_MPI_APPLICATION_ARGS=
export I2G_MPI_TYPE=$MPI_FLAVOR_LOWER
# optional hooks
#export I2G_MPI_PRE_RUN_HOOK=mpi-hooks.sh
#export I2G_MPI_POST_RUN_HOOK=mpi-hooks.sh
# If these are set then you will get more debugging information.
#export I2G_MPI_START_VERBOSE=1
#export I2G_MPI_START_DEBUG=1
# Invoke mpi-start.
$I2G_MPI_START
In your JDL file you should set the jobtype as
Normal
and then set
the
NodeNumber
to the number of desired nodes. The
Executable
should be your wrapper script for MPI-START (
mpi-start-wrapper.sh
in this case) and the
Arguments
are your MPI binary and the MPI flavour that it uses. MPI-START allows user defined extensions via hooks, check the
MPI-START Hook CookBook
for examples. Here is an example JDL for the submission of the
cpi
application using 10 processes:
JobType = "Normal";
VirtualOrganisation = "dteam";
NodeNumber = 10;
Executable = "mpi-start-wrapper.sh";
Arguments = "cpi OPENMPI";
StdOutput = "cpi.out";
StdError = "cpi.err";
InputSandbox = {"cpi", "mpi-start-wrapper.sh"};
OutputSandbox = {"cpi.out", "cpi.err"};
Requirements = Member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& Member("MPI-INFINIBAND", other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& Member("OPENMPI-1.2.5", other.GlueHostApplicationSoftwareRunTimeEnvironment);
Please note that the
NodeNumber
variable refers to the number of
CPUs you are requiring. The new EGEE MPI WG is discussing how to implement a
fine-grained selection of the nodes and/or CPUs (i.e. to specify the number of
processors per node and the number of nodes, not only the number of CPUs).
Sites without MPI-START
If you want to get the convenience of mpi-start usage even at sites which have not yet installed it, you can submit a tarball (e.g. mpi-start-0.0.58.tar.gz) of mpi-start along with your job (in the input sandbox) and add the following lines at the start of your wrapper script to set it up:
if [ "x$I2G_MPI_START" = "x" ]; then
# untar mpi-start and set up variables
tar xzf mpi-start-*.tar.gz
export I2G_MPI_START=bin/mpi-start
MPIRUN=`which mpirun`
export MPI_MPICH_PATH=`dirname $MPIRUN`
fi
Known issues and ongoing work
MPI job support is a necessity for many application areas, however the configuration is highly dependent of the
cluster architecture. The design of '''mpi-start''' was focussed in making the MPI job submission as transparent
as possible from the cluster details.
However there are still issues to be adressed, please feel free to send an e-mail to complete this list:
* Selection of the proper combination core/CPU
Fine-grained selection of the nodes and/or cores per CPU. This affects the efficiency of the code as the
scaling properties of MPI codes are highly depending on it.
* Identification of the interconnect technology in the Information System
Need for an univocal name to identify via the information system the available cluster interconnects to be used in
GlueHostApplicationSoftwareRunTimeEnvironment.
* Resource Reservation
Reservation of resources for MPI jobs. MPI jobs should not be sharing the same node with a serial job.
In the case of those MPI applications which are using intensively the interconexion, sharing the node with
a serial or with another MPI job, is not an option.