Submitting jobs to the grid using the DIRAC API

Tired of using ganga to submit jobs to the Grid? Bulk submission taking it's toll on your sanity? Why not try using the DIRAC API directly? This is ideal when you want to run a large number of your own MC production. All you need is a little bit of python.


First of all, for those new to this game, DIRAC is the LHCb workload/data management system. It stands for Distributed Infrastructure with Remote Agent Control. Basically DIRAC submits lots of pilot jobs to the Grid and you submit jobs to the central DIRAC task queue. When the pilot jobs are satisfied that things on the site are OK, they pull in your job from the central queue.

API stands for application programming interface. Basically this is what Ganga is using under the hood when it submits to the Grid. The API documentation can be found here.

Notes / presentations explaining how stuff works are also available here and here.

Setting up your environment

On (best to use SL3 for the moment) you need to source some scripts and generate your proxy.


Job Submission

Create a python script that runs your generation/selection/analysis (or all 3, which can be done using the step() methods). In this script you can use the API to setup the application that you want to run and the options that you want to pass to it. You simply run the script by doing this:


There is an example of a Gauss job below.

Job Monitoring

Once you have submitted your jobs to the Grid, you will want to know which ones are running and when they have completed. You can find this all out using the monitoring page. This is useful for keeping track of your job IDs.


Alternatively, you can use this little script (altering it where you see fit - i.e. looping over a list of jobIDs):

from DIRAC.Client.Dirac import *
import sys
dirac = Dirac()
jobid = sys.argv[1]

Call it and run it like this:

python jobID


Although DIRAC states that your job was successful, this isn't necessarily the case! What DIRAC actually reports is whether the pilot job (i.e. the job that is sent to the site first of all before pulling your job from the central LHCb task queue) was successful or not. This is completely different from whether your application was successful or not. You only find out what happened once the job output has been retrieved (see next section). Alternatively, if you click on the jobID in the monitoring page, you sometimes get extra information about why the job failed (like application exit status).

Job Output Retrieval

Use this script to get the output of your job once it is complete

from DIRAC.Client.Dirac import *
import sys
dirac = Dirac()
jobid     = sys.argv[1]

Call it (or anything you like) and run it like this:

python jobID

If you are storing files on the grid, you can look them up in the LFC:

lfc-ls /grid/lhcb/user/g/gcowan/jobID

These files can be copied to your local space (assuming you have sufficient available space - lxplus users beware!). Note the lack of /grid/ at the start of the LFN below.

from DIRAC.Client.Dirac import *
dirac = Dirac()

Sending libraries with the job

If you write your own DaVinci algorithm, or have the need for some other custom built library, then you will need to specify this in the job options before submitting to DIRAC. This is done by creating a directory called lib and setting this up in the input sandbox. See the example Gauss job below for this. In addition, other libraries can be placed in this directory (which can be useful if a site is missing particular parts of the installation. Of course, you only actually find this out once a job has failed.

Example Gauss job

 from DIRAC.Client.Dirac import *
 dirac = Dirac()
 job = Job()
 f = open('jobIDs.txt', 'w')

 for i in range(1,131000,1000):
     opts = """
     ApplicationMgr.EvtMax = 1000;
     GaussGen.RunNumber = %d;
     """ % i
     jobid = dirac.submit(job,verbose=0)
     print "Job ID = ",jobid
     f.write('%s\n' % jobid)

This can be used to submit 131 jobs, each of which generates 1000 events using Gauss. The options that are specified here overwrite those that are in the Gauss.opts file. Notice that I change the RunNumber with each submission so that each job is started with a different seed. If you don't do this all of the jobs will generate the same data! In this case, I have configured Gauss.opts to run in generator mode, as described here. Some things to note:

  • The script outputs the jobIDs to a file which is a handy way of keeping track of your jobs. Obviously, ganga does do all this stuff for you, but I found it to be incredibly slow when submitting a large number of jobs. The ganga CLI also was taking a long time to start up and there were inconsistencies between the job status reported by ganga and by DIRAC monitoring.
  • DIRAC will return the files you ask for in the OutputSandbox, unless they are >10MB, in which case, they will be automatically stored on the Grid.
  • The log file must be of the form ApplicationName_Version.log, otherwise it will not be returned.
  • There appears to be a bug in DIRAC where is it not possible to specify more than one file in setOutputData(). This is the method that allows you to explicitly state which files are to be stored on the Grid.

Problems you are likely to experience on the Grid

Obviously, when running on the Grid, your jobs end up running on a wide variety of sites, each of which has a slightly different (or incorrect) configuration for the LHCb software. This manifests itself in your jobs failing in a wide variety of ways. Here are a few examples:

Missing libraries

The application log file contains lines like:

gcc323/bin/Gauss.exe: error while loading shared libraries: cannot open shared object file: No such file or directory

This is due to a dodgy installation of the experiment software, meaning that not everything is available. Very annoying. Each time I find a site in this state, I have emailed the DIRAC developers to get the site blacklisted. You can find the status of the grid from an LHCb point of view by looking at Raja Nandakumar's page. Hopefully the results of the LCG SAM (site availability monitoring) tests will be integrated with DIRAC soon.

Wrong versions of OpenSSL etc

This is just a warning and can be ignored.

RuntimeWarning: Python C API version mismatch for module rand: This Python has API version 1012, module rand has version 1011.
  import OpenSSL.rand 

Job options misconfigured

Obviously this message indicates that there is something wrong with the job options file created by DIRAC. I do not understand what caused this, as I simply used the setOptions method to configure the number of events, run number, and names of the output files. Leaving the output files with the default values seems to resolve the problem.

JobOptionsSvc       FATAL Job options errors.
JobOptionsSvc       FATAL gaudirun.opts(143,1) : ERROR #3 : Syntax error
ApplicationMgr      FATAL Error initializing JobOptionsSvc
ApplicationMgr      FATAL Application configuration failed

-- GreigCowan - 23 Jul 2007

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2007-07-23 - GreigCowan
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback