Deploying Sun Grid Engine in a LCG Computing Element
Disclaimer
This software is considered
beta -- you use it at your own risk. It may be not fully optimized or correct and therefore, should be considered as experimental. There is no guarantee that it is compatible with the way in which your site is configured.
About
Author: Gonçalo Borges,
goncalo@lipNOSPAMPLEASE.pt
Version: 0.0.0-2
Abstract: SGE Yaim integration Manual for lcg-CE and glite-WN
RPMS Description:
gliteWN-yaimtosge-0.0.0-2.i386.rpm: Modification to standard glite yaim tool for glite-WN integration using SGE as scheduler system. It will install:
{{{
/etc/profile.d/sge.sh (csh): To set the proper environment;
/opt/glite/yaim/scripts/configure_sgeclient.pm: SGE installation directories;
/opt/glite/yaim/scripts/nodesge-info.def: SGE nodes functions definition;
/opt/glite/yaim/functions/config_sge_client: Configures SGE exec host;
}}}
lcgCE-yaimtosge-0.0.0-2.i386.rpm: Modification to standard glite yaim tool for lcg-CE integration using SGE as scheduler system. It will install:
{{{
/etc/profile.d/sge.sh (csh): To set the proper environment;
/opt/glite/yaim/scripts/configure_sgeserver.pm: SGE installation directories;
/opt/glite/yaim/scripts/nodesge-info.def: SGE nodes functions definition;
/opt/glite/yaim/functions/config_sge_server: Configures SGE QMASTER
/opt/globus/lib/perl/Globus/GRAM/JobManager/lcgsge.pm: The SGE jobmanager;
/opt/lcg/libexec/lcg-info-dynamic-sge: The SGE CE GRIS/GIIS perl script.
}}}
sge-V60u7_1-3.i386.rpm: Contains the binaries and libraries needed to run sge commands;
sge-utils-V60u7_1-3.i386.rpm: Instalation scripts and SGE utilities;
sge-daemons-V60u7_1-3.i386.rpm: The SGE daemons;
sge-ckpt-V60u7_1-3.i386.rpm: For checkpointing purposes;
sge-parallel-V60u7_1-3.i386.rpm: For running parallel environments, as
OpenMpi, Mpich, etc;
sge-docs-V60u7_1-3.i386.rpm: Documentation, manuals and examples;
sge-qmon-V60u7_1-3.i386.rpm: The SGE GUI interface;
RPMS Download:
http://www.lip.pt/grid/gliteWN-yaimtosge-0.0.0-2.i386.rpm
http://www.lip.pt/grid/lcgCE-yaimtosge-0.0.0-2.i386.rpm
http://www.lip.pt/grid/sge-V60u7_1-3.i386.rpm
http://www.lip.pt/grid/sge-utils-V60u7_1-3.i386.rpm
http://www.lip.pt/grid/sge-daemons-V60u7_1-3.i386.rpm
http://www.lip.pt/grid/sge-ckpt-V60u7_1-3.i386.rpm
http://www.lip.pt/grid/sge-parallel-V60u7_1-3.i386.rpm
http://www.lip.pt/grid/sge-docs-V60u7_1-3.i386.rpm
http://www.lip.pt/grid/sge-qmon-V60u7_1-3.i386.rpm
Pré-Requisites:
The SGE rpm packages delivered together with this manual were built under SLC4 with the additional packaging of the libdb-4.2.so library in order for them to work in SLC3. Please report problem to
goncalo@lipNOSPAMPLEASE.pt.
We will assume that the standard “lcg-CE” and “glite-WN” softwares are already installed (but not configured) in the proper machines. The installation should have been performed using the instructions proposed in:
http://grid-deployment.web.cern.ch/grid-deployment/documentation/LCG2-Manual-Install/
https://twiki.cern.ch/twiki/bin/view/EGEE/CertTestBedWorld
Check that your apt repositories are properly set to:
{{{
[root@ce03 root]# cat /etc/apt/sources.list.d/lcg-ca.list
rpm
http://linuxsoft.cern.ch/
LCG-CAs/current production
[root@ce03 root]# cat /etc/apt/sources.list.d/lcg.list
rpm
http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/
rhel30 externals Release3.0 updates
[root@ce03 root]# cat /etc/apt/sources.list.d/cern.list
rpm
http://linuxsoft.cern.ch
cern/slc30X/i386/apt os updates extras
rpm-src
http://linuxsoft.cern.ch
cern/slc30X/i386/apt os updates extras
}}}
You should stop following the LCG manual and start to follow this one right before you reach the Middleware Configuration section.
Please ensure that “passwordless ssh” will work from a WN pool account to a CE pool account. This is something which is not specific for this precise deployment but needed by all grid infrastructures.
CE gatekeeper Installation
Sun Grid Engine needs a Qmaster machine which, in the present manual, we assume it will be installed in the CE Gatekeeper. The SGE rpms will deploy all files under /usr/local/sge/V60u7_1 and link that directory to /usr/local/sge/pro. Latter on, $SGE_ROOT will be defined as /usr/local/sge/pro in such a way that we can keep old SGE versions and use them when needed. Please install the following SGE packages (require “openmotif (>= 2.2.3-5)” package, if not already there, which you may find in the SLC repositories):
{{{
sge-V60u7_1-3.i386.rpm
sge-utils-V60u7_1-3.i386.rpm
sge-daemons-V60u7_1-3.i386.rpm
sge-qmon-V60u7_1-3.i386.rpm
sge-ckpt-V60u7_1-3.i386.rpm
sge-parallel-V60u7_1-3.i386.rpm
sge-docs-V60u7_1-3.i386.rpm
}}}
{{{
[root@
~]# rpm -ivh sge-V60u7_1-3.i386.rpm sge-utils-V60u7_1-3.i386.rpm sge-daemons-V60u7_1-3.i386.rpm sge-qmon-V60u7_1-3.i386.rpm sge-ckpt-V60u7_1-3.i386.rpm sge-parallel-V60u7_1-3.i386.rpm sge-docs-V60u7_1-3.i386.rpm
Preparing... ########################################### [100%]
-
- :sge ########################################### [ 14%]
- :sge-utils ########################################### [ 29%]
- :sge-daemons ########################################### [ 43%]
- :sge-qmon ########################################### [ 57%]
- :sge-ckpt ########################################### [ 71%]
- :sge-parallel ########################################### [ 86%]
- :sge-docs ########################################### [100%]
}}}
* Install lcgCE-yaimtosge-0.0.0-2.i386.rpm which includes the modifications to the standard yaim tool allowing the SGE scheduler configuration. This rpm requires “perl-XML-Simple >= 2.14-2.2” package which you can download from http://rpmfind.net/linux/rpm2html/search.php?query=perl-XML-Simple
. It also requires glite-yaim >= 3.0.0-34.
(!) Please upgrade your yaim version to the last release.
{{{
[root@ ~]# rpm -ivh lcgCE-yaimtosge-0.0.0-1.i386.rpm
Preparing... ########################################### [100%]
-
- :lcgCE-yaimtosge ########################################### [100%]
}}}
* Add the following values to your site-info.def file:
{{{
SGE_QMASTER=$CE_HOST
DEFAULT_DOMAIN=$MY_DOMAIN
ADMIN_MAIL=
}}}
* Check that the “WN_LIST”, “USERS_CONF”, “VOS” and "QUEUES" variables are also properly defined in your site-info.def file. The content of these variables will be used to build the SGE exec node list, the SGE user sets and the SGE local queues. For the time being, VO users in the USERS_CONF file have to be defined following the same order as the QUEUES definition. Otherwise, the VO SGE userset will not correspond to the correct VO QUEUE. This will be fixed in the future...
* Configure the CE running SGE using the “CE_sge” node definiton.
{{{
[root@ ~]#/opt/glite/yaim/scripts/configure_node <path_to_your_site-info.def_file> CE_sge BDII_site
}}}
* The CE configuration must be always run before the WN configurations, otherwise the SGE daemons in the WNs will not be started since there is no Qmaster host associated to them.
* SGE prompt commands will be accessible after a new login (to source the /etc/profile.d/ scripts).
* To start SGE GUI, using the “qmon” comand, you need to install “xorg-x11-xauth >= 6.8.2-1”. Unfortunately, this package is not available in the SLC3 repository and you have to download it from the SLC4 one http://linuxsoft.cern.ch/cern/slc4X/i386/SL/RPMS/xorg-x11-xauth-6.8.2-1.EL.13.37.i386.rpm
If you have configured your CE with wrong values for the “WN_LIST”, “USERS_CONF”, “VOS” and "QUEUES" variables, an easy way to solve the question is to delete the /usr/local/sge/pro/default directory and run the CE configuration again.
WN Installation
Please install the following sge packages:
{{{
sge-V60u7_1-3.i386.rpm
sge-utils-V60u7_1-3.i386.rpm
sge-daemons-V60u7_1-3.i386.rpm
sge-parallel-V60u7_1-3.i386.rpm
sge-docs-V60u7_1-3.i386.rpm
}}}
{{{
[root@ ~]# rpm -ivh sge-V60u7_1-3.i386.rpm sge-utils-V60u7_1-3.i386.rpm sge-daemons-V60u7_1-3.i386.rpm sge-parallel-V60u7_1-3.i386.rpm sge-docs-V60u7_1-3.i386.rpm
Preparing... ########################################### [100%]
-
- :sge ########################################### [ 20%]
- :sge-utils ########################################### [ 40%]
- :sge-daemons ########################################### [ 60%]
- :sge-parallel ########################################### [ 80%]
- :sge-docs ########################################### [100%]
}}}
* Install gliteWN-yaimtosge-0.0.0-2.i386.rpm which includes the modifications to the standard yaim tool allowing the SGE client configuration.
{{{
[root@ ~]# rpm -ivh gliteWN-yaimtosge-0.0.0-1.i386.rpm
Preparing... ########################################### [100%]
-
-
-
- :gliteWN-yaimtosge ########################################### [100%]
}}}
* Use the same site-info.def file as in the Gatekeeper case. This file should already include definitions for “SGE_QMASTER”, “DEFAULT_DOMAIN”, “ADMIN_MAIL” variables
* Configure the WN using the “WN_sge” node definiton.
{{{
[root@ ~]# /opt/glite/yaim/scripts/configure_node <path_to_your_site-info.def_file> WN_sge
}}}
Testing:
Test the information system using the following commands:
{{{
ldapsearch -x -h -p 2135 -b "mds-vo-name=local,o=grid"
ldapsearch -x -h -p 2170 -b "mds-vo-name=,o=grid"
}}}
* Check if it is returning the proper queue names and available resources.
* Try to submit a simple script from a give pool account in your CE. From this test you will check if the SGE prompt commands (like qsub or qstat) are working. If the job finishes sucessfully, the stdout and stderr files won't be available in our CE since, in a normal grid event, they would be transfered directly from the WN to the RB using GSIFTP. Try to check stderr/stdout files in the WN...
* Try to do a globus-job-run using fork from a UI (you have to start your proxy first):
{{{
[goncalo@ui01]$ globus-job-run ce03.lip.pt:2119/jobmanager-fork /bin/uname -a
Linux ce03.lip.pt 2.6.9-34.EL.cern #1 Sun Mar 12 12:19:53 CET 2006 i686 athlon i386 GNU/Linux
}}}
* Try to do a globus-job-run using lcgsge from a UI:
{{{
[goncalo@ui01]$ globus-job-run ce03.lip.pt:2119/jobmanager-lcgsge /bin/uname -a
Linux sgewn01.lip.pt 2.6.9-34.EL.cern #1 Sun Mar 12 12:19:53 CET 2006 i686 i686 i386 GNU/Linux
}}}
* Try to submit a job though the RB from a UI:
{{{
[goncalo@ui01]$ edg-job-submit -r ce03.lip.pt:2119/jobmanager-lcgsge-dteamgrid well.jdl
Selected Virtual Organisation name (from proxy certificate extension): dteam
Connecting to host rb02.lip.pt, port 7772
Logging to host rb02.lip.pt, port 9002
*******************************************************************************************
JOB SUBMIT OUTCOME
The job has been successfully submitted to the Network Server.
Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:
- https://rb02.lip.pt:9000/Ab0W2EpWMPkpJKjAMpRCsQ
*******************************************************************************************
[goncalo@ui01 ce02]$ edg-job-status https://rb02.lip.pt:9000/Ab0W2EpWMPkpJKjAMpRCsQ
***********************************************************
BOOKKEEPING INFORMATION:
Status info for the Job : https://rb02.lip.pt:9000/Ab0W2EpWMPkpJKjAMpRCsQ
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce03.lip.pt:2119/jobmanager-lcgsge-dteamgrid
reached on: Fri Feb 2 18:42:38 2007
***********************************************************
[goncalo@ui01 ce02]$ cat /tmp/jobOutput/goncalo_Ab0W2EpWMPkpJKjAMpRCsQ/well.out
One Perl out of the sea!
This is Linux sgewn01.lip.pt 2.6.9-34.EL.cern #1 Sun Mar 12 12:19:53 CET 2006 i686 i686 i386 GNU/Linux
Fri Feb 2 18:31:14 WET 2007
}}}