-- IsabelCampos - 03 Apr 2009

Site configuration for MPI

This document is intended to help EGEE site administrators to properly support MPI deployments. The recomendations for supporting MPI on EGEE where drafted by the Technical Coordination Group in MPI and can be found on on this link.

Installation of MPI

The two MPI flavours currently supported are OpenMPI and MPICH

OpenMPI

You can build your own package by downloading the source RPM from the Open MPI downloads page.

If you whish to use any special interconnect, like Infiniband, ensure to enable it when compiling by passing the --with-openib option to the configure script. More details about this topic can be found here: configuration of Infiniband Switches for glite

If your batch system is Torque, it is recommended to enable the Torque TM subsystem (thus getting proper accounting numbers) by passing the --enable-mca-dso=pls-tm option.

Furthermore, if you wish to use the Intel Compiler, you must define the CC, CXX, FC and F77 variables to point to the correct compilers.

 # export F77=ifort
 # export FC=ifort
 # export CC=icc
 # export CXX=icpc

Compile and build your RPM following

# rpmbuild -ba openmpi-1.2.5.spec -D 'name openmpi' -D '_packager name' -D 'configure_options --with-openib --enable-mca-dso=pls-tm'

Once compiled, you can deploy this package to all of your WNs.

MPICH

TBD

MPI-Start

MPI-Start is a recommended solution to hide the implementation details for the submitted jobs. It was developed inside the Int.EU.Grid project and can be downloaded from here. It should be installed on every node involved with MPI. We recomend to install it in the Worker Nodes, in the Computing Element and in the User Interface (for user testing purposes) at least.

Batch system configuration

Here you can find the instructions to manually configure different batch systems to execute MPI jobs. There is a yaim module that can perform automatic configuration for PBS/Torque schedulers. You can find more information about this module below.

PBS-based schedulers

PBS-based schedulers such as Torque do not deal properly with CPU allocations, because they assume homogeneous systems with the same number of CPUs for all the nodes (machines). $cpu_per_node can be set in the jobmanager, but it has to be the same for all the machines. Furthermore, PBS doesn't seem to understand that there might be processes running in 1 CPU of each machine of 2 CPUs in a farm, so there are still half the capacity free for more jobs. For these reasons, it is needed to add some special configuration to the batch system.

Torque

Edit (create it if it does note exist) your torque configuration file (/var/spool/pbs/torque.cfg) and add a line containing:

SUBMITFILTER /var/spool/pbs/submit_filter.pl

Then download the submit_filter.pl from here and put it in the above location.

This filter modifies the script coming from the submission, rewriting the -l nodes=XX option with specific requests, based on the information given by pbsnodes -a command.

The submit filter is crucial. Failing to use the submit filter translates in the job being submitted to only one node, where all the MPI processes are allocated too, instead of distributing the job across several nodes.

Warning: glite updates tend to rewrite torque.cfg. Check that the submit filter line is still there after performing an update

Maui

Edit your configuration file (usually under /var/spool/maui/maui.cfg) and check that it contains the following lines:

ENABLEMULTINODEJOBS TRUE

ENABLEMULTIREQJOBS TRUE

These parameter allows a job to span to more than one node and to specify multiple independent resource requests.

Sun Grid Engine (SGE)

MPI support for the lcg-CE with SGE is still under development, however, it is possible configure MPI in a lcg-CE supporting SGE in production on your own risk following these instructions.

Note: In this example we do assume that SGE batch system is already configured and it is working correctly. You can get information about SGE configuration and how to configure it with lcg-CE or CREAM-CE on the SGE CookBook V2-1.

Install the necessary I2G MPI RPMs on the CE and on all the necessary WNs:

 # i2g-openmpi-1.2.2-6
 # i2g-mpi-start-0.0.58-1

Warning, important The package i2g-mpi-start install the following files mpi_start.csh mpi_start.sh on /etc/profile.d/. This should be taken into account. In this example, these files were deleted because we load the MPI environment in other way (it is explained in th the next steps).

Tip, idea Note: In this example, we decided to install the OpenMPI version compiled for the I2G project i2g-openmpi-1.2.2-6, but it's possible to use another OpenMPI version more recent and compiled on your own.

After installing the MPI RPMS, it is necessary installing the last version for the JobManager which supports MPI. The JobManager last version can be downloaded from this link,SGE JobManager V0.53.

Install the new JobManager version:

tar -zxvf sge_job_manager.tar.gz
cd sge_job_manager/JobManager/

cp -fr lcgsge.pm /opt/globus/lib/perl/Globus/GRAM/JobManager/lcgsge.pm

Restart the services on the CE.

Warning, important It is recommendable do backup to the previous JobManager /opt/globus/lib/perl/Globus/GRAM/JobManager/lcgsge.pm

Once the OpenMPI, i2g-mpi-start, and JobManager packages are installed it is time to create a SGE parallel environment and configure it on all the necessary queues.

• Add a parallel environment:

qconf -ap parallel_environment_name

One example:

pe_name            openmpi_egee
slots              16
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

Tip, idea On variable slots it is configured the number total of nodes that can be used by a MPI job. Variables start_proc_args and stop_proc_args are left with the value by default "/bin/true", because kick-off and stop of MPI sub tasks is done by the i2g-mpi-start.

Configure the parallel environment created on the necessary queues in order to have a queue to run paralleljobs:

qconf -mq queue_name

- To modify an existing queue, add the parallel environment create on the variable pe_list

One example:

 
[root@ce3 ~]# qconf -sq GRID_ops
qname                 GRID_ops  
[ ... ]
pe_list               openmpi_egee
[ ... ] 

Tip, idea With the command qconf -sq queue_name, we can see the configuration for a determinate queue.

Warning, important To publish the correct MPI values changes must be done by hand at the moment, there is not a YAIM function available to do this yet.

Add variables to the environment:

Adding at the end of /etc/profile.d/grid-env.sh file on the CE the following lines:

gridenv_set         "MPI_OPENMPI_VERSION" "1.2.2"  # The OPENMPI version installed on the CE
gridenv_set         "I2G_MPI_START" "/opt/i2g/bin/mpi-start" # The PATH where is installed mpi-start script, by default this one
gridenv_set         "MPI_OPENMPI_PATH" "/opt/i2g/openmpi/" # The PATH where is installed OPENMPI
gridenv_set         "MPI_SSH_HOST_BASED_AUTH" "yes" # 
gridenv_set         "MPI_SHARED_HOME" "no" # If you have a shared file-system area between WNs, you can publish with: 
gridenv_set         "MPI_MPICC_OPTS" "-m32"
gridenv_set         "MPI_MPICXX_OPTS" "-m32"
gridenv_set         "MPI_MPIF77_OPTS" "-m32"

Help Explanation of these variables:

MPI_SHARED_HOME      set this to "yes" if you have a shared home area between WNs.
MPI_SSH_HOST_BASED_AUTH    If you do NOT have SSH Hostbased Authentication between your WNs set this to "no"
MPI__MPIEXEC    If you are using OSC mpiexec with mpich, set this to the location of the mpiexec program, e.g. "/usr/bin/mpiexec"
MPI_MPI_START    Location of mpi-start if not installed in standard location /opt/i2g/bin/mpi-start 

In this example, we are using the tar-WN version on the WNs, so we use a file /opt/cesga/lcg-wn/external/etc/profile.d/grid-env.sh configured on the JobManager (see SGE CookBook to get more information) to load the correct environment for each job. So we added at the end of /opt/cesga/lcg-wn/external/etc/profile.d/grid-env.sh file the following variables:

gridenv_set         "MPI_OPENMPI_VERSION" "1.2.2"
gridenv_set         "I2G_MPI_START" "/opt/i2g/bin/mpi-start"
gridenv_set         "MPI_OPENMPI_PATH" "/opt/i2g/openmpi/"
gridenv_set         "MPI_SSH_HOST_BASED_AUTH" "yes"
gridenv_set         "MPI_SHARED_HOME" "no"
gridenv_set         "MPI_MPICC_OPTS" "-m32"
gridenv_set         "MPI_MPICXX_OPTS" "-m32"
gridenv_set         "MPI_MPIF77_OPTS" "-m32"

Tip, idea If you are not using the tar-WN version, it would be the same process that for the CE, in other words, add these variables at the end of /etc/profile.d/grid-env.sh file.

Adding the variables to publish MPI support on site-bdii, modifying the file /opt/glite/etc/gip/ldif/static-file-Cluster.ldif on the CE:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.2
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI
GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SSH_HOST_BASED_AUTH

ldapsearch -x -H ldap://ce2.egee.cesga.es:2170 -b mds-vo-name=resource,o=grid | grep -i mpi
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START
GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SSH_HOST_BASED_AUTH
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.2

Help, As explained before, you should add the variable GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME to publish on the site-bdii, if you have a shared home area between WNs. In this example, we don't have a shared home area between WNs.

Tip, idea Testing:

Testing the environment

[dteam001@compute-2-8 ~]$ env|grep MPI_
MPI_SSH_HOST_BASED_AUTH=yes
MPI_OPENMPI_PATH=/opt/i2g/openmpi/
MPI_OPENMPI_VERSION=1.2.2
MPI_SHARED_HOME=no
I2G_MPI_START=/opt/i2g/bin/mpi-start

Submitting a job directly to the SGE batch system:

qsub -l num_proc=1 -pe openmpi_egee 2 test.sh

Note: In this example, it is requested 2 nodes with one processor each one.

test.sh content:

#!/bin/bash


MPI_FLAVOR=OPENMPI
MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`

# Path where is installed the MPI package
export MPI_PATH=/opt/i2g/openmpi

# Ensure the prefix is correctly set.  Don't rely on the defaults.
eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
export I2G_${MPI_FLAVOR}_PREFIX


export X509_USER_PROXY=/tmp/x509up_u527

export I2G_TMP=/tmp
export I2G_LOCATION=/opt/i2g
#export I2G_OPENMPI_PREFIX=/opt/i2g/openmpi
export I2G_MPI_TYPE=openmpi
export I2G_MPI_FLAVOUR=openmpi

#PATH to the application that we want to run
## This application should be copied in the WN, in this example it was copied to the /tmp directory
export I2G_MPI_APPLICATION=/tmp/cpi 

export I2G_MPI_APPLICATION_ARGS=
export I2G_MPI_NP=2
export I2G_MPI_JOB_NUMBER=0
export I2G_MPI_STARTUP_INFO=/home/glite/dteam004
export I2G_MPI_PRECOMMAND=
export I2G_MPI_RELAY=

# PATH where is installed de mpi-start RPM
export I2G_MPI_START=/opt/i2g/bin/mpi-start

export I2G_MPI_START_DEBUG=1
export I2G_MPI_START_VERBOSE=1
$I2G_MPI_START

Submitting the previous job using a WMS:

Building the JDL file:

esfreire@ui mpi_egee]$ cat lanzar_cpi_mpi.jdl
JobType = "Normal";
VirtualOrganisation = "dteam";
NodeNumber = “2;
# We use a wrapper script to starting the MPI job
Executable = "mpi-start-wrapper.sh";
Arguments = "cpi OPENMPI";
StdOutput = "cpi.out";
StdError = "cpi.err";
InputSandbox = {"cpi","mpi-start-wrapper.sh"};
OutputSandbox = {"cpi.out","cpi.err"};
# In this example, we are requesting available queue on one of CEs in CESGA-EGEE production
Requirements    = other.GlueCEUniqueID == "ce2.egee.cesga.es:2119/jobmanager-lcgsge-GRID_dteam"

mpi-start-wrapper.sh content:

[esfreire@ui mpi_egee]$ cat mpi-start-wrapper.sh
#!/bin/bash
# Pull in the arguments.

MY_EXECUTABLE=`pwd`/$1

MPI_FLAVOR=$2


# Convert flavor to lowercase for passing to mpi-start
MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`

# Pull out the correct paths for the requested flavor.
eval MPI_PATH=`printenv MPI_${MPI_FLAVOR}_PATH`
export MPI_PATH=$MPI_PATH

# Ensure the prefix is correctly set.  Don't rely on the defaults.
eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
export I2G_${MPI_FLAVOR}_PREFIX

# Touch the executable.  It exist must for the shared file system check.
# If it does not, then mpi-start may try to distribute the executable
# when it shouldn't.
touch $MY_EXECUTABLE
chmod +x $MY_EXECUTABLE

# Setup for mpi-start.
export I2G_MPI_APPLICATION=$MY_EXECUTABLE
export I2G_MPI_APPLICATION_ARGS=
export I2G_MPI_TYPE=$MPI_FLAVOR_LOWER
# optional hooks
#export I2G_MPI_PRE_RUN_HOOK=mpi-hooks.sh
#export I2G_MPI_POST_RUN_HOOK=mpi-hooks.sh

# If these are set then you will get more debugging information.
#export I2G_MPI_START_VERBOSE=1
#export I2G_MPI_START_DEBUG=1

# Invoke mpi-start.
$I2G_MPI_START

Submitting the job: glite-wms-job-submit -a -o mpi.job lanzar_cpi_mpi.jdl

Downloading the job results: glite-wms-job-output -i mpi.job --dir ~/jobOutput/

[esfreire@ui mpi_egee]$ cat ../jobOutput/cpi.out 
Pre MPI_init                                                                    
Pre MPI_init                                                                    
Pre MPI_init                                                                    
Pre MPI_init                                                                    
Pre MPI_Comm_size                                                               
Pre MPI_Comm_size                                                               
Pre MPI_Comm_rank                                                               
Pre MPI_Get_processor_name                                                      
Process 0 of 4 is on compute-2-1.local                                          
Pre MPI_Comm_size                                                               
Pre MPI_Comm_rank                                                               
Pre MPI_Get_processor_name                                                      
Process 1 of 4 is on compute-2-1.local                                          
Pre MPI_Comm_rank                                                               
Pre MPI_Get_processor_name                                                      
Process 2 of 4 is on compute-2-0.local                                          
Pre MPI_Comm_size                                                               
Pre MPI_Comm_rank                                                               
Pre MPI_Get_processor_name                                                      
Process 3 of 4 is on compute-2-0.local                                          
pi is approximately 3.1415926544231239, Error is 0.0000000008333307             
wall clock time = 0.012632  

Note: In both examples, we are using a compiled application called “cpi” what calculates PI number.

Configuration of the Worker nodes

It is neccessary to use either a shared storage area (i.e. $HOME or a scratch dir) in all the nodes *or set up a passwordless SSH access* (i.e. hostbased access) between them.

Each one has its pros and cons, so its up to you which one to choose depending on your site hardware & software details.

Environment variables

These environment variables should be set for jobs executing on a worker node in an MPI site. This is normally done by adding a script to /etc/profile.d. The environment variables should be a straight mapping from the environment variables.

All prefixed with MPI_:

Mandatory

  • MPI_<flavour>_VERSION
  • MPI_<flavour>_PATH

Examples:

  • MPI_OPENMPI_VERSION=1.2.5
  • MPI_OPENMPI_PATH=/opt/i2g/openmpi

mpiexec

Some sites use mpiexec as it uses the scheduler interface directly to execute multi-node jobs, and so usage is accounted correctly. Sites which have this installed should set the following environment variable to the top directory in their mpiexec installation: e.g.

MPI_MPICH_MPIEXEC=/opt/mpiexec-0.80

Support for this format was added to i2g-mpi-start as of version 0.0.46.

Optional

  • MPI_<flavour>_COMPILER
  • MPI_<flavour>_<version>_PATH
  • MPI_<flavour>_<version>_<compiler>_PATH
  • MPI_INTERCONNECT=<interconnect>

Configuration of the Information System

Sites may install different implementations (or flavours) of MPI. It is expected that users will converge on OpenMPI in the future, but for the moment, a variety of libraries and tools are in use. It is important therefore that users can use the information system to locate sites with the software they require. You should publish some values to let the world know which flavour of MPI you are supporting, as well as the interconnect and some other things. Everything related with MPI should be published as GlueHostApplicationSoftwareRunTimeEnvironment in the corresponding sections.

Publishing MPI-START

If you support MPI-START publish it with:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START

Publishing your MPI Flavour

Publish which flavour, version of MPI you are using. Eventually you can also specify the compiler and version. MPI flavours are MPICH, MPICH2 and OPENMPI. For example:

GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI

GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.5

GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.2.5-ICC

Publishing your MPI Interconnect

If you have any special Interconnect (like Infiniband, or Myrinet) you can publish it like:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-Infiniband

Publish other features related to MPI

If you have a shared filesystem area between WNs, you can publish with:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME

Yaim Configuration

A YAIM module for MPI configuration is available as part of the gLite distribution (glite-yaim-mpi). This module will perform automatic configuration for PBS/Torque schedulers, enabling the use of submit filters, definition of environment variables and configuration of the information system. If you have another batch system or prefer to do the configuration manually, you can find the instructions above.

Configure MPI in site-info.def

Individual "flavours" of MPI are enabled by setting the associated variable to "yes". For example, to enable Open MPI, add the following:

MPI_OPENMPI_ENABLE="yes"

You may set the path and version for a flavour of MPI as follows:

MPI_OPENMPI_PATH="/opt/i2g/openmpi/"
MPI_OPENMPI_VERSION="1.2.5"

The remaining variables are:

Variable meaning
MPI_SHARED_HOME set this to "yes" if you have a shared home area between WNs.
MPI_SSH_HOST_BASED_AUTH If you do NOT have SSH Hostbased Authentication between your WNs set this to "no"
MPI__MPIEXEC If you are using OSC mpiexec with mpich, set this to the location of the mpiexec program, e.g. "/usr/bin/mpiexec"
MPI_MPI_START Location of mpi-start if not installed in standard location /opt/i2g/bin/mpi-start

Here is an example configuration:

#----------------------------------
# MPI-related configuration:
#----------------------------------
# Several MPI implementations (or "flavours") are available.
# If you do NOT want a flavour to be installed/configured, set its variable
# to "no". Else, set it to "yes" (default). If you want to use an
# already installed version of an implementation, set its "_PATH" and
# "_VERSION" variables to match your setup (examples below).
#
# NOTE 1: the CE_RUNTIMEENV will be automatically updated in the file
# functions/config_mpi, so that the CE advertises the MPI implementations
# you choose here - you do NOT have to change it manually in this file.
# It will become something like this:
#
#   CE_RUNTIMEENV="$CE_RUNTIMEENV
#              MPI_MPICH
#              MPI_MPICH2
#              MPI_OPENMPI
#              MPI_LAM"
#
# NOTE 2: it is currently NOT possible to configure multiple concurrent
# versions of the same implementations (e.g. MPICH-1.2.3 and MPICH-1.2.7)
# using YAIM. Customize "/opt/glite/yaim/functions/config_mpi" file
# to do so.

MPI_OPENMPI_ENABLE="yes"
MPI_MPICH_ENABLE="yes"
MPI_MPICH2_ENABLE="yes"
MPI_LAM_ENABLE="yes"

#---
# Example for using an already installed version of MPI.
# Setting "_PATH" and "_VERSION" variables will prevent YAIM
# from downloading and installing the gLite-provided packages.
# Just fill in the path to its current installation (e.g. "/usr")
# and which version it is (e.g. "6.5.9").
#---
MPI_OPENMPI_PATH="/opt/i2g/openmpi/"
MPI_OPENMPI_VERSION="1.2.5"
MPI_MPICH_PATH="/opt/mpich-1.2.7p1/"
MPI_MPICH_VERSION="1.2.7p1"
MPI_MPICH2_PATH="/opt/mpich2-1.0.4/"
MPI_MPICH2_VERSION="1.0.4"
MPI_LAM_VERSION="7.1.2"

# If you do NOT provide a shared home, set $MPI_SHARED_HOME to "no" (default).
#
MPI_SHARED_HOME="yes"

#
# If you do NOT have SSH Hostbased Authentication between your WNs, set the below
# variable to "no" (default). Else, set it to "yes".
#
MPI_SSH_HOST_BASED_AUTH="no"

#
# If you provide an 'mpiexec' for MPICH or MPICH2, please state the full path to
# that file here (http://www.osc.edu/~pw/mpiexec/index.php). Else, leave empty.
#
#MPI_MPICH_MPIEXEC="/usr/bin/mpiexec"
MPI_MPICH_MPIEXEC="/usr/bin/mpiexec"

Configure CE

/opt/glite/yaim/bin/yaim -c -s site-info.def -n MPI_CE

Configure WN

For a Torque worker node:

/opt/glite/yaim/bin/yaim -c -s site-info.def -n MPI_WN -n glite-WN -n TORQUE_client

Testing configuration

You can do some basic tests by logging in on a WN as a pool user and running the following:

[dte056@cagnode48 dte056]$ env|grep MPI_

You should see something like this:

MPI_MPICC_OPTS=-m32
MPI_SSH_HOST_BASED_AUTH=yes
MPI_OPENMPI_PATH=/opt/openmpi/1.1
MPI_LAM_VERSION=7.1.2
MPI_MPICXX_OPTS=-m32
MPI_LAM_PATH=/usr
MPI_OPENMPI_VERSION=1.1
MPI_MPIF77_OPTS=-m32
MPI_MPICH_VERSION=1.2.7
MPI_MPIEXEC_PATH=/opt/mpiexec-0.80
MPI_MPICH2_PATH=/opt/mpich2-1.0.4
MPI_MPICH2_VERSION=1.0.4
I2G_MPI_START=/opt/i2g/bin/mpi-start
MPI_MPICH_PATH=/opt/mpich-1.2.7p1

You can also try submitting a job to your site using the instructions found below.

Submission of MPI Jobs

In order to invoke MPI-START you need a wrapper script that sets the environment variables that define your job. This script is generic and should not need to have significant modifications made to it.


#!/bin/bash
# Pull in the arguments.

MY_EXECUTABLE=`pwd`/$1

MPI_FLAVOR=$2
 
 # Convert flavor to lowercase for passing to mpi-start.
 MPI_FLAVOR_LOWER=`echo $MPI_FLAVOR | tr '[:upper:]' '[:lower:]'`
 
 # Pull out the correct paths for the requested flavor.
 eval MPI_PATH=`printenv MPI_${MPI_FLAVOR}_PATH`
 
 # Ensure the prefix is correctly set.  Don't rely on the defaults.
 eval I2G_${MPI_FLAVOR}_PREFIX=$MPI_PATH
 export I2G_${MPI_FLAVOR}_PREFIX
 
 # Touch the executable.  It exist must for the shared file system check.
 # If it does not, then mpi-start may try to distribute the executable
 # when it shouldn't.
 touch $MY_EXECUTABLE
 chmod +x $MY_EXECUTABLE
 
 # Setup for mpi-start.
 export I2G_MPI_APPLICATION=$MY_EXECUTABLE
 export I2G_MPI_APPLICATION_ARGS=
 export I2G_MPI_TYPE=$MPI_FLAVOR_LOWER
 # optional hooks
 #export I2G_MPI_PRE_RUN_HOOK=mpi-hooks.sh
 #export I2G_MPI_POST_RUN_HOOK=mpi-hooks.sh
 
 # If these are set then you will get more debugging information.
 #export I2G_MPI_START_VERBOSE=1
 #export I2G_MPI_START_DEBUG=1
 
 # Invoke mpi-start.
 $I2G_MPI_START
 

In your JDL file you should set the jobtype as Normal and then set the NodeNumber to the number of desired nodes. The Executable should be your wrapper script for MPI-START (mpi-start-wrapper.sh in this case) and the Arguments are your MPI binary and the MPI flavour that it uses. MPI-START allows user defined extensions via hooks, check the MPI-START Hook CookBook for examples. Here is an example JDL for the submission of the cpi application using 10 processes:

 JobType = "Normal";
 VirtualOrganisation = "dteam";
 NodeNumber = 10;
 Executable = "mpi-start-wrapper.sh";
 Arguments = "cpi OPENMPI";
 StdOutput = "cpi.out";
 StdError = "cpi.err";
 InputSandbox = {"cpi", "mpi-start-wrapper.sh"};
 OutputSandbox = {"cpi.out", "cpi.err"};
 Requirements = Member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment)
             && Member("MPI-INFINIBAND", other.GlueHostApplicationSoftwareRunTimeEnvironment)
             && Member("OPENMPI-1.2.5",  other.GlueHostApplicationSoftwareRunTimeEnvironment);

Please note that the NodeNumber variable refers to the number of CPUs you are requiring. The new EGEE MPI WG is discussing how to implement a fine-grained selection of the nodes and/or CPUs (i.e. to specify the number of processors per node and the number of nodes, not only the number of CPUs).

Sites without MPI-START

If you want to get the convenience of mpi-start usage even at sites which have not yet installed it, you can submit a tarball (e.g. mpi-start-0.0.58.tar.gz) of mpi-start along with your job (in the input sandbox) and add the following lines at the start of your wrapper script to set it up:

if [ "x$I2G_MPI_START" = "x" ]; then
    # untar mpi-start and set up variables
    tar xzf mpi-start-*.tar.gz
    export I2G_MPI_START=bin/mpi-start
    MPIRUN=`which mpirun`
    export MPI_MPICH_PATH=`dirname $MPIRUN`
fi

Known issues and ongoing work

MPI job support is a necessity for many application areas, however the configuration is highly dependent of the cluster architecture. The design of '''mpi-start''' was focussed in making the MPI job submission as transparent as possible from the cluster details.

However there are still issues to be adressed, please feel free to send an e-mail to complete this list:

* Selection of the proper combination core/CPU Fine-grained selection of the nodes and/or cores per CPU. This affects the efficiency of the code as the scaling properties of MPI codes are highly depending on it.

* Identification of the interconnect technology in the Information System Need for an univocal name to identify via the information system the available cluster interconnects to be used in GlueHostApplicationSoftwareRunTimeEnvironment.

* Resource Reservation Reservation of resources for MPI jobs. MPI jobs should not be sharing the same node with a serial job. In the case of those MPI applications which are using intensively the interconexion, sharing the node with a serial or with another MPI job, is not an option.

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2010-04-29 - EstebanFreire
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback