Preliminary steps
To use the grid resources you need a
certificate and to join a
virtual organisation (VO). The certificate guarantees your identity, the association to a VO gives you access to the corresponding resources (computer centres accepting job of a given VO etc...).
Each future user has to apply for a certificate. It is a 5-min procedure and the instructions are under:
http://lcg.web.cern.ch/LCG/digital.htm
. Please take the time to read the page in order to apply to the correct certification authority (it depends on your affiliation).
The second step is joining the VO. If you have been told to join the GEAR VO (often use for the initial traing/application porting do the following:
- Upload the certificate in your browser (part o the previous step)
- Visit https://lcg-voms.cern.ch:8443/vo/vo.gear.cern.ch/vomrs
and follow the registration Phases I and II. It is very important that they fully fill the phase II after Phase I you will get a mail leading to Phase II.
Import the certificate on lxplus (here we assume your (initial) submission host. Instructions (linked from the prev. pages) are under
https://ca.cern.ch/ca/Help/?kbid=024010
.
Try out your certificate
Normally one would do:
bash
. /afs/cern.ch/project/gd/LCG-share/sl5/etc/profile.d/grid-env.sh
voms-proxy-init --voms vo.gear.cern.ch
voms-proxy-info
but let Ganga do all the job for you
Fire up Ganga
The main tool for porting and supporting application on the Grid is Ganga (
http://cern.ch/ganga
). A very complete user doc page is available under
http://ganga.web.cern.ch/ganga/user/index.php
(also reachable from the Ganga home page). We suggest to start from this because Ganga is a natural bridge from your familiar batch system to Grid infrastructures.
To know more about the internals of the Grid middleware used in LCG, a very good document is available (maintained by the LCG project). This can be found under:
https://edms.cern.ch/file/722398/1.2/gLite-3-UserGuide.pdf
This section is from the Ganga pages:
http://ganga.web.cern.ch/ganga/user/installation/other.php
. Essentially you have to use Ganga (no need of install it from lxplus) .
First of all log-in on lxplus.
Get this file and save it as ~/.gangarc:
.gangarc example
% bash
% /afs/cern.ch/sw/ganga/install/5.5.3/bin/ganga
First examples
First helloWorld example (see the Ganga user guide):
j0 = Job()
j0.application = Executable(exe=File('/bin/echo'), args=['HelloWorld'])
j0.submit()
this will execute on your machine.
If you want to run it on
LSF, please do:
j1 = j0.copy() #"Same" job but can be executed (you cannot re-execute j0 since it ran already)
j1.backend='LSF'
j1.submit()
Then on the Grid (
LCG)...
j2 = j0.copy() #"Same" job but can be executed (you cannot re-execute j0 since it ran already)
j2.backend='LCG'
j2.submit()
More on Ganga
More on files (executable, input, output)
Slightly more complex example. To run this simple example, you need to have the
gangaHello.py in your home directory. The file executes it and copy files onto the working directory jobs. The files declared in the "outputsandbox" will be returned to the user in case of successful execution.
os.system('cp /etc/fstab ~/input1') # prepare a file
jf=j0.copy()
jf.application = Executable(exe=File('~/gangaHello.py'),args=['World!'])
jf.inputsandbox = [File('~/input1')]
jf.outputsandbox=['output1','output2','output3']
jf.submit()
!ls -l $jf.outputdir
Example with splitter (same exe multiple inputs)
mySplitter = ArgSplitter(args=[[str(x)] for x in range(10)])
full_print(mySplitter)
### Typical output:
### ArgSplitter (
### args = [['0'], ['1'], ['2'], ['3'], ['4'], ['5'], ['6'], ['7'], ['8'], ['9']]
### )
j3=j2.copy()
# j3 will be one job with 10 subjobs. A single submission generates 10 (sub)jobs each one with input parameter 0,1,2,...9
j3.splitter=mySplitter
j3.submit()
After a little while you should be seen something like this (the "mother" job is called 16 and its subjobs are 16.0, 16.1, ..., 16.9) :
fqid | status | name | subjobs | application | backend | backend.actualCE
-----------------------------------------------------------------------------------------------------------------------------
16.0 | running | | | Executable | LCG | ce.cyf-kr.edu.pl:2119/jobmanager-pbs-gear
16.1 | running | | | Executable | LCG | ce.cyf-kr.edu.pl:2119/jobmanager-pbs-gear
16.2 | submitted | | | Executable | LCG | gazon.nikhef.nl:2119/jobmanager-pbs-ekster
16.3 | submitted | | | Executable | LCG | trekker.nikhef.nl:2119/jobmanager-pbs-ekster
16.4 | running | | | Executable | LCG | ce.cyf-kr.edu.pl:2119/jobmanager-pbs-gear
16.5 | submitted | | | Executable | LCG | gazon.nikhef.nl:2119/jobmanager-pbs-ekster
16.6 | running | | | Executable | LCG |ce01.lcg.cscs.ch:2119/jobmanager-lcgpbs-other
16.7 | submitted | | | Executable | LCG | gazon.nikhef.nl:2119/jobmanager-pbs-medium
16.8 | submitted | | | Executable | LCG | gazon.nikhef.nl:2119/jobmanager-pbs-medium
16.9 | submitted | | | Executable | LCG |ce11.lcg.cscs.ch:2119/jobmanager-lcgpbs-other
To find out (after job completion) what happened to subjob 16.0, just do (from the
Ganga prompt)
outdir = jobs[-1].subjobs[0].outputdir
!ls -l $outdir
The next block of code does an automatic merge of the output of the files (of the last job). Note that from the ganga promp you can execute a file by entering exefile(myfile.py)
import os
lastJob = jobs[-1]
nj = len(lastJob.subjobs)
cj = nj
cmd = 'cat '
for sj in lastJob.subjobs:
if sj.status=='completed':
if sj.backend.exitcode==0:
cj -=1
cmd += sj.outputdir + 'stdout' + ' '
if cj==0:
print 'Job completed: merge possible'
print cmd
os.system(cmd)
One more reason to use subjobs...
Realistic tasks consist of several jobs. In most of the cases
all jobs (or at least a given percentage) should be executed successfully. Some jobs may fail and the bookkeeping of resubmitting (and keep track of which job "replcaes" its failed counter part is a nightmare. In the following example (for illustration: obtained by artificially allowing also sites with a old version of the python interpreter to be selected) some subjobs failed:
100.91 | failed | | | Executable | LCG |dc2-grid-65.brunel.ac.uk:2119/jobmanager-lcgp
100.93 | failed | | | Executable | LCG |dgc-grid-44.brunel.ac.uk:2119/jobmanager-lcgp
100.95 | failed | | | Executable | LCG | ce.cyf-kr.edu.pl:2119/jobmanager-pbs-gear
100.99 | failed | | | Executable | LCG | ce.cyf-kr.edu.pl:2119/jobmanager-pbs-gear
Ganga allows you to resubmit the failed one by "recycling" their subjob number (I assume this is the last one in jobs[]):
jobs[-1].subjobs.select(status="failed").resubmit()
Monitoring my jobs... (extremely preliminary - complete new version)
Resources?
With a valid proxy, one can use
lcg-infosites
. An example is attached here:
bash-3.2$ lcg-infosites -v 2 --vo vo.gear.cern.ch ce
RAMMemory Operating System System Version Processor Subcluster name
-------------------------------------------------------------------------------------------------------------------------
2000 ScientificCERNSLC Boron Xeon ce202.cern.ch
2000 ScientificCERNSLC Boron Xeon ce203.cern.ch
2048 ScientificSL Beryllium Xeon ce.cyf-kr.edu.pl
32768 ScientificSL Boron Xeon ce02.lcg.cscs.ch
32768 ScientificSL Boron Opteron ce01.lcg.cscs.ch
32768 ScientificSL Boron Xeon ce11.lcg.cscs.ch
2000 ScientificCERNSLC Beryllium Xeon ce106.cern.ch
2000 ScientificCERNSLC Boron Xeon ce129.cern.ch
2000 ScientificCERNSLC Boron Xeon ce133.cern.ch
2000 ScientificCERNSLC Beryllium Xeon ce125.cern.ch
2000 ScientificCERNSLC Beryllium Xeon ce111.cern.ch
2000 ScientificCERNSLC Beryllium Xeon ce103.cern.ch
2000 ScientificCERNSLC Boron Xeon ce128.cern.ch
2000 ScientificCERNSLC Beryllium Xeon ce113.cern.ch
2000 ScientificCERNSLC Beryllium Xeon ce124.cern.ch
2000 ScientificCERNSLC Beryllium Xeon ce112.cern.ch
3072 CentOS Final IA32 gazon.nikhef.nl
2000 ScientificCERNSLC Beryllium Xeon ce107.cern.ch
2000 ScientificCERNSLC Beryllium Xeon ce105.cern.ch
3072 CentOS Final IA32 gazon.nikhef.nl
2000 ScientificCERNSLC Beryllium Xeon ce127.cern.ch
2000 ScientificCERNSLC Beryllium Xeon ce114.cern.ch
2000 ScientificCERNSLC Boron Xeon ce130.cern.ch
2000 ScientificCERNSLC Beryllium Xeon ce104.cern.ch
2000 ScientificCERNSLC Boron Xeon ce131.cern.ch
2000 ScientificCERNSLC Beryllium Xeon ce126.cern.ch
2000 ScientificCERNSLC Boron Xeon ce132.cern.ch
3072 CentOS Final IA32 trekker.nikhef.nl
2048 ScientificSL Boron Opteron grid36.lal.in2p3.fr
2000 ScientificCERNSLC Boron Xeon grid10.lal.in2p3.fr
4054 ScientificSL Final Dual Core AMD Opteron(tm) Processor 265 ce201.cern.ch
16384 ScientificSL Final Xeon dgc-grid-40.brunel.ac.uk
4054 ScientificSL Beryllium Dual Core AMD Opteron(tm) Processor 265 dc2-grid-65.brunel.ac.uk
0 dgc-grid-44.brunel.ac.uk
Release "Beryllium" correspond to SLC4, "Boron" to SLC5.
--
MassimoLamanna - 20-Apr-2010