Batch System Support and Coordination
Batch System Support is a community driven effort by those that require CE adaption for a specific batch system. This aim of this page is to be a focal point for these activities. It links to information for system administrators trying to grid-enable their farm, and also contains a general "How To" for adaption the CE to a specific batch systems.
Supported CEs
There is support for
LCG-CE and
CREAM. There is a transition plan to replace LCG-CE with CREAM, which includes certain acceptance criteria related to batch system integration. See the
status page of batch system support on CREAM.
Notes on parameter passing
See the plans for
ParameterPassing to the batch system.
Torque batch system
Torque integration in general is maintained by NIKHEF within SA3.
Torque integration with blah is maintained by INFN within SA3.
Condor batch system
The e-mail list of the Condor batch system group is
project-eu-egee-batchsystem-condor@cernNOSPAMPLEASE.ch
You can subscribe yourself through the SIMBA interface. (
http://simba.cern.ch
)
Condor integration is maintained by IFAE (PIC) within SA3.
lcg-CE or creamCE
Installation instructions for condor
Queue simulation
Queue simulation instructions for Condor
SGE batch system
Check
Current status of the implementation of SGE wiki page.
SGE integration is maintained by CESGA within SA3.
LSF batch system
Workplan for LSF batch system testing at PIC:
- Research for the possibility of having lsf server running on a virtual machine
- Installation & configuration of server and clients
- Possibles enhancements of blah for lsf
- Accounting verification apel/dgas
- Information system:
- testing scheduler scripts
- possibility of running different clusters for just one computing element (in particular slc4 testing)
- Stress test for more than a few nodes
- SAM test script implementation
Support for blah with LSF is maintained by INFN within SA3.
Testing
Details of community based testing should be put here.
Information for LRMS integrators
Information on how to integrate your LRMS for CREAM will appear in the
CreamLRMSCookBook.
To add a batch system to the glite release check the following points:
- Nodetypes supporting the batch system
- lcg-CE SL3, SL4 (when available)
- CREAM SL4 (as soon as it is available, you may leverage work from glite-CE as both use BLAH)
- Jobmanager on lcg-CE
- BLAH plugin for CREAM
- Information Provider
- Accounting
- APEL on lcg-CE
- APEL on CREAM (take glite-CE work as BLAH is the same)
RPMs are needed for Jobmanager, BLAH plugin, Information Provider and APEL (specific part for the batch system).
Information providers
For each batch system, there should be a backend command or set of backend commands that produce a representation of the queue state in a prescribed format. This output is taken by the lcg-info-dynamic-scheduler to calculate, amongst others, the
EstimatedResponseTime.
In the current there need to be two (possibly three) scripts:
- lcg-info-dynamic-provider-{pbs,lsf,sge,condor...}
- lrmsinfo-{pbs,lsf,sge,condor,...} : this script is called by the lcg-info-dynamic-scheduler
- vomaxjobs-{maui,lsf,sge,condor,...} : this optional script is called by the lcg-info-dynamic-scheduler
As can be depicted with this diagram:
However, in the near future the lcg-info-dynamic-scheduler will incorporate the functionality of the lcg-info-dynamic-* scripts, as can be depicted as:
This transition will not be a 'big-bang' upgrade but will be phased, e.g, for the batch system 'pbs':
- a new version of the 'lcg-info-dynamic-scheduler' will be rolled out , with a flag 'use_old_style_output' set. Test until satisfied;
- new versions of the 'lrmsinfo-pbs' and 'vomaxjobs-maui' scripts will be rolled out, but they will still produce old-style output (using a configuration setting). Again, test until satisfied;
- the 'lrmsinfo-pbs' and 'vomaxjobs-maui' scripts will be configured to produce 'new style' output (Protocol_V2). Again, test until satisfied;
- as the 'lcg-info-dynamic-scheduler' script still has it's configuration setting 'use_old_style_output' set, the GIP will not see anything different;
- the 'lcg-info-dynamic-pbs' script is stopped;
- the 'use_old_style_output' flag is set to 'false' in the 'lcg-info-dynamic-scheduler' script and the GIP now receives all information from only the 'lcg-info-dynamic-scheduler' script. Do a final test to verify that the GIP is still happy.
Thus a phased upgrade can be done for each batch system.
Configuration
YAIM configuration for Jobmanager, BLAH plugin, Information Provider and APEL but not necessarily for the batch system itself.
For meta-rpms and configuration targets, please follow the model adopted for Torque;
- glite-TORQUE_server - what you need to install on your HEAD
- glite-TORQUE_client - what you need to install on your WN
- glite-TORQUE_utils - what you need to install on your CE or BDII_site (this will include submitter stuff, info providers, accounting etc).
Anticipated installation scenarios;
- CE with own torque server - CE + TORQUE_server + TORQUE_utils
- CE with separate torque server - CE + TORQUE_utils
- Standalone TORQUE server - TORQUE_server + TORQUE_utils
- WN for torque - glite-WN + TORQUE_client
- BDII_site - glite-BDII + TORQUE_utils
From this you can see that the <BATCH>_utils configuration target will have to detect the node-type in order to know what to configure.
For other batch systems, gLite will not distribute the batch system software itself, so you would expect (for example)
configuration targets
- SGE_server
- SGE_clients
- SGE_utils
These could be implemented by glite-yaim-sge or a separate yaim rpm could be produced in each case.
meta-packages