Submitting jobs to the grid using the DIRAC API
Tired of using ganga to submit jobs to the Grid? Bulk submission taking it's toll on your sanity? Why not try using the
DIRAC API directly? This is ideal when you want to run a large number of your own MC production. All you need is a little bit of python.
Documentation
First of all, for those new to this game,
DIRAC is the LHCb workload/data management system. It stands for Distributed Infrastructure with Remote Agent Control. Basically
DIRAC submits lots of pilot jobs to the Grid and you submit jobs to the central
DIRAC task queue. When the pilot jobs are satisfied that things on the site are OK, they pull in your job from the central queue.
API stands for
application programming interface
. Basically this is what Ganga is using under the hood when it submits to the Grid. The API documentation can be found
here
.
Notes / presentations explaining how stuff works are also available
here
and
here
.
Setting up your environment
On lxslc3.cern.ch (best to use SL3 for the moment) you need to source some scripts and generate your proxy.
GridEnv
DIRACEnv
voms-proxy-init
Job Submission
Create a python script that runs your generation/selection/analysis (or all 3, which can be done using the
step()
methods). In this script you can use the API to setup the application that you want to run and the options that you want to pass to it. You simply run the script by doing this:
python job-script.py
There is an example of a Gauss job below.
Job Monitoring
Once you have submitted your jobs to the Grid, you will want to know which ones are running and when they have completed. You can find this all out using the monitoring page. This is useful for keeping track of your job IDs.
Monitoring,
http://lhcb.pic.es/DIRAC/Monitoring/Analysis/
Alternatively, you can use this little script (altering it where you see fit - i.e. looping over a list of jobIDs):
from DIRAC.Client.Dirac import *
import sys
dirac = Dirac()
jobid = sys.argv[1]
dirac.status(jobid)
Call it Monitor.py and run it like this:
python Monitor.py jobID
WARNING
Although
DIRAC states that your job was successful, this isn't necessarily the case! What
DIRAC actually reports is whether the pilot job (i.e. the job that is sent to the site first of all before pulling your job from the central LHCb task queue) was successful or not. This is completely different from whether your application was successful or not. You only find out what happened once the job output has been retrieved (see next section). Alternatively, if you click on the jobID in the monitoring page, you
sometimes get extra information about why the job failed (like application exit status).
Job Output Retrieval
Use this script to get the output of your job once it is complete
from DIRAC.Client.Dirac import *
import sys
dirac = Dirac()
jobid = sys.argv[1]
dirac.getOutput(jobid)
Call it Retrieve.py (or anything you like) and run it like this:
python Retrieve.py jobID
If you are storing files on the grid, you can look them up in the LFC:
lfc-ls /grid/lhcb/user/g/gcowan/jobID
These files can be copied to your local space (assuming you have sufficient available space - lxplus users beware!). Note the lack of /grid/ at the start of the LFN below.
from DIRAC.Client.Dirac import *
dirac = Dirac()
dirac.getOutputData('/lhcb/user/g/gcowan/105953/GaussMonitor-gc.root')
Sending libraries with the job
If you write your own
DaVinci algorithm, or have the need for some other custom built library, then you will need to specify this in the job options before submitting to
DIRAC. This is done by creating a directory called
lib
and setting this up in the input sandbox. See the example Gauss job below for this. In addition, other libraries can be placed in this directory (which can be useful if a site is missing particular parts of the installation. Of course, you only actually find this out once a job has failed.
Example Gauss job
from DIRAC.Client.Dirac import *
dirac = Dirac()
job = Job()
job.setApplication('Gauss','v25r10')
job.setInputSandbox(['path/to/Gauss.opts','path/to/lib'])
f = open('jobIDs.txt', 'w')
for i in range(1,131000,1000):
opts = """
ApplicationMgr.EvtMax = 1000;
GaussGen.RunNumber = %d;
""" % i
job.setOption(opts)
job.setOutputSandbox(['GaussMonitor-gc.root','GaussHistos-gc.root','Gauss_v25r10.log'])
jobid = dirac.submit(job,verbose=0)
print "Job ID = ",jobid
f.write('%s\n' % jobid)
f.close()
This can be used to submit 131 jobs, each of which generates 1000 events using Gauss. The options that are specified here overwrite those that are in the Gauss.opts file. Notice that I change the
RunNumber
with each submission so that each job is started with a different seed. If you don't do this all of the jobs will generate the same data! In this case, I have configured Gauss.opts to run in generator mode, as described
here
. Some things to note:
- The script outputs the jobIDs to a file which is a handy way of keeping track of your jobs. Obviously, ganga does do all this stuff for you, but I found it to be incredibly slow when submitting a large number of jobs. The ganga CLI also was taking a long time to start up and there were inconsistencies between the job status reported by ganga and by DIRAC monitoring.
- DIRAC will return the files you ask for in the OutputSandbox, unless they are >10MB, in which case, they will be automatically stored on the Grid.
- The log file must be of the form ApplicationName_Version.log, otherwise it will not be returned.
- There appears to be a bug in DIRAC where is it not possible to specify more than one file in setOutputData(). This is the method that allows you to explicitly state which files are to be stored on the Grid.
Problems you are likely to experience on the Grid
Obviously, when running on the Grid, your jobs end up running on a wide variety of sites, each of which has a slightly different (or incorrect) configuration for the LHCb software. This manifests itself in your jobs failing in a wide variety of ways. Here are a few examples:
Missing libraries
The application log file contains lines like:
/home/lhcb051/globus-tmp.egee013.14578.0/WMS_egee013_015125_https_3a_2f_2frb114.cern.ch_3a9000_2fpxlYbxEegDE1n3rv5iLTSw/lib/lhcb/GAUSS/GAUSS_v25r10/InstallArea/slc3_ia32_
gcc323/bin/Gauss.exe: error while loading shared libraries: libGeneratorsLib.so: cannot open shared object file: No such file or directory
This is due to a dodgy installation of the experiment software, meaning that not everything is available. Very annoying. Each time I find a site in this state, I have emailed the
DIRAC developers to get the site blacklisted. You can find the status of the grid from an LHCb point of view by looking at Raja Nandakumar's
page
. Hopefully the results of the LCG SAM (site availability monitoring) tests will be integrated with
DIRAC soon.
Wrong versions of OpenSSL etc
This is just a warning and can be ignored.
/scr/u/WMS_wn019_017145_https_3a_2f_2frb114.cern.ch_3a9000_2fRVIxX0IJ4PwrWe2vKecULQ/DIRAC/python/DIRAC/Utility/DISET/OpenSSL/__init__.py:23:
RuntimeWarning: Python C API version mismatch for module rand: This Python has API version 1012, module rand has version 1011.
import OpenSSL.rand
Job options misconfigured
Obviously this message indicates that there is something wrong with the job options file created by
DIRAC. I do not understand what caused this, as I simply used the setOptions method to configure the number of events, run number, and names of the output files. Leaving the output files with the default values seems to resolve the problem.
JobOptionsSvc FATAL Job options errors.
JobOptionsSvc FATAL gaudirun.opts(143,1) : ERROR #3 : Syntax error
ApplicationMgr FATAL Error initializing JobOptionsSvc
ApplicationMgr FATAL Application configuration failed
--
GreigCowan - 23 Jul 2007