LHCb executes different kinds of job types on its distributed computing facilities (aka Grid).
Organized production activities. These are all data processing activities that are handled centrally by the "LHCb production team". Usually the data processed are large data data-sets, e.g. all RAW data of a given year. The organized activities can be further sub-divided into
Data Processing Productions: All processing of "real data", i.e. data that was collected by the detector or derivations of it (e.g. reconstructed data). Activities include the first pass processing from the detector RAW files. Re-processing of previously collected RAW data or re-/incremental-stripping of previously reconstructed data. Except of first pass processing these activities are tried to be executed with high load in the shortest possible time. The planning for such activities is planned much in advance and time scales are setup by the computing team in co-ordination with the physics teams. The processing usually concentrates on T0/1 sites but can also be extended to "attached Tier2s" if very high load is estimated.
Working Group Productions: An analysis or other activity that is too big for an individual user to organize and process, e.g. because the output data would not fit into its user space. The production team is able to help in setting up these processings via a production activity. The planning of these activities is less stringent.
Monte Carlo Productions: Simulation productions are executed throughout the year and are executed mainly on Tier2 sites but also other Tier levels if computing capacities are available.
User jobs: These are analysis jobs submitted by the physicists to analyze data - usually generated by productions. User jobs have the highest priority of all activities. They usually require input data and will be executed where this input data is available (Tier0/1/2D sites).
NOTE: This twiki page concentrates mainly on Data Processing Productions but can also be applied for Working Group Productions. Parts of this twiki can also be applied to Monte Carlo Productions but those are slightly different.
Prerequisites to become a production manager
To work as a production manager the following rights and roles are needed:
You need special LHCb DIRAC roles, at least "lhcb_prmgr" in addition possibly "lhcb_tech" and "lhcb_ppg", please contact LHCb Dirac admins (Stefan, Joel, Federico, ...) to obtain these
Subscribe the CERN egroups "lhcb-production-manager" and "lhcb-grid" at https://e-groups.cern.ch
Production Constituents
This section describes the different constituents of a production which are provided by different roles within the collaboration (see Fig 1).
Figure 1: Production Workflow
Step
Responsible: convener of the activity (e.g. stripping convener), production managers may help
A production can execute several "Step"s. Each step describes the execution of a certain program which usually will take some input data process it and produce another set of output data. Production Steps can be found in the Dirac Step Manager (see Fig 2).
Figure 2: Step Manager
To find a given step one can sub-select the steps in the left pane, e.g. by application type or state. Steps which are not used anymore in any workflows and possibly cannot be re-used in the future shall be marked as "Obsolete".
Tip: In order to edit a step left-click on the individual step and select "Edit". One needs to select role "lhcb_tech"
A step is shown in Fig 3.
Figure 3: Step
The fields of the step are:
an meaningful name describing the step, it has no further implications on the workflows
ProcessingPass
The output on where this step will write its file in the processing pass
Application & Version
The LHCb application name and the version used in this step
SystemConfig
the string describing the platform combination, which binary version the application shall run on
Option files
options that will be passed to the application when invoked
Options format
A hint to Dirac on which application type is being executed, select from the pull down menu
Multicore
can this application be executed in multicore mode (for the time being no)
Extra packages
additional packages needed in addition to the regular software stack dependencies. E.g. AppConfig drives configurations of applications, SQLDDB is responsible for connecting to conditions data
Runtime project
CondDB
the conditions database tag
DDDB
tag for the detector description database
DQTag
?
Visible
Set to Y(es) if this step shall produce an entry in the Bookkeeping path, applies to almost all steps but e.g. Merging
Usable
tag whether this step is usable in a request
Input/Output File Types
The file type taken as input data and produced as output by the step, the "Visible" column denotes whether this file type is also visible in the BK, i.e. can be selected by physicists as input
Tip: Check that the specific application and extra package version are released on the requested platform and available on CVMFS (see also FAQ)
Tip: In order to use tags within a request (see next section) they need to be set "Y"es in the "Usable" field
Request
Responsible: Can be either done by the convener of the activity or the production manager
A request can consist of several steps and is the initial constituent from which a production is generated. Within the LHCb Request Manager (see Fig 4)
Figure 4: Request Manager
As for the step manager one can narrow down the requests by putting selection criteria in the left pane
Tip: The "Show models only:" tick box only applies to monte carlo productions
Request Generation & Preparation
Tip: In order to generate a new request, first put selection criteria in the left pane which narrow down the list of requests to a similar one you want to create. Then (as "lhcb_prmgr") left click an old request and select "Dulicate". When asked whether to keep the processing pass, this can be answered with yes, the individual steps an be changed afterwards.
Once a new request is generated or an existing "New" one shall be changed it needs to be selected in the request manager and left-click "edited". The result is a request modification page as shown in Fig 5.
Figure 5: Request to edit
The fields of the request are
a meaningful name of this request, is has no operational influence except that it will show up in the transformation monitor (see later)
Priority
used for Monte Carlo productions, data processing productions are always priority "1a"
Inform also
a list of users or email addresses to also inform about this request, can be e.g. the application convener(s)
Input Data
input data for this request to be selected from the Bookkeeping (see below)
Processing Pass
the individual steps of this request which constitute the input/outputs processed / generated
Comments
free text for the request editors to put additonal information, e.g. run ranges, special conditions, etc
Select Input Data
Pressing the "Select Input Data" button will launch a new popup window (see Fig 6) in order to select the data from the Bookkeeping for this request.
Figure 6: Select Input Data
Once a final leave of the tree is reached the "Select" button will become active. Some tips for what concerns the different constituents of the input data path
Config for production data is in the sub-tree "LHCb". Another commonly used config is "validation" for any pre-production and verification activities.
version denotes the "activity" when taking this data, e.g. "Collision12" for pp collision data in 2012. Collision12_25 the same as above but with 25 ns bunch spacing.
Conditions usually look like "--. The Beam energy in 2011 and 2012 was mostly of one kind 3500 and 4000 respectively. The Velo state for data processing activities is always "VeloClosed". The Magnet polarity often switches between "MagUp" and "MagDown".
Processing Pass denotes the processing of the data with the different versions of reconstruction, stripping etc programs and their tree within
DQ Flag, the data quality of the event data collected from the detector. The DQ team would flag data after some first processing when coming out of the detector.
File Type, the kind of files used for the processing, e.g. RAW for reconstruction, FULL.DST for data stripping, etc.
Production, ?
TCK, the trigger configuration used when taking the data, usually for data processing productions this will be set to "ALL"
Change Processing Pass
The individual steps of the processing pass of a request can be added / changed / deleted in the lower left part of the request. When working with steps a new popup window will appear (see Fig 7) that allows the selection of a step.
Figure 7: Change Processing Pass
The individual steps of a request can be changed, added, deleted as needed. This will follow the workflows as produced in the step manager and described in the LHCb workflows twiki.
Tip: when selecting the first step of a request (also when changing) tick the "Show also non-coinciding steps".
When selecting the second and later steps only those steps will be listed that match as input the output of the previous step.
Request Signing
Once a request has been configured correctly it needs to go through a signing process, which is
Request State
Role needed
Action needed
New Request State
New
lhcb_tech
Submit to production team
Submitted
Submitted
lhcb_tech
Sign tech
Tech OK
Tech OK
lhcb_ppg
Sign ppg
Accepted
Accepted
lhcb_prmgr
Activate
Active
For the different signatures the roles need to be changed accordingly. An example is shown in Fig 8. with the change of role (bottom) and left click "Sign" the request can be signed as "lhcb_tech" in the upcoming window.
Figure 8: Sign lhcb_tech
Production Generation
Responsible: Production Manager
Once a request has reached the state "Active" a production can be launched from it. With role "lhcb_prmgr" "edit" the production, then press "Generate". The first popup window will allow the selection of the "template" of production (see Fig 9). The types relevant for this twiki are:
RecoStripping_run.py for data processing activities such as reconstruction, stripping, re-stripping, incremental stripping, etc.
everyThingElse_run.py for any other production type, e.g. Working Group productions, DataSwimming, etc.
Figure 9: Production template selection
Reconstruction / Stripping Production
Description of reconstruction / stripping productions
Figure 10: Reco/Stripping Production Template
Only one of those workflows needs to be set to "True"
GENERAL: Set True for certification test
Not used by production managers (certification of new Dirac versions)
GENERAL: Set True for local test
Set "True" if a production shall be launched from the command line (see later)
GENERAL: Set True for validation prod
Set "True" for pre-productions, the output of this production will go into config "validation" in the Bookkeeping, this can be useful e.g. for tests done before a big production campaign
GENERAL: Workflow destination site
Usually left at "ALL" to let Dirac decide on where to launch the jobs of this production. In special cases a certain site can be selected (use the "LHCb name", e.g. "LCG.CNAF.it"
GENERAL: Workflow string to append to the production name
In case a production has been launched already from this request this number needs to be increased by 1 in order to make the generated production name unique, also in case a production produced errors and needs to be deleted and re-launched
GENERAL: discrete list of run numbers
a comma delimited (?) list of runs this production shall run on
GENERAL: extra options as python dict
a python dictionary which will add extra options to the different steps of this production, e.g. increase the verbosity level of the Brunel step1 as {1:"from Configurables import Brunel; Brunel().Monitors += [ \'NameAuditor\' ]"}
GENERAL: fraction to process per run
in case only parts of a run shall be processed, but the integer percentage number here (e.g. 50 for half of the files). Note this was used end of 2012 for first pass processing and DQ flagging. Not sure this will be used again
GENERAL: minimum number of files to process per run
Integer number of minimal files to process in case above is set. Again used for 2012 first pass processing, probably not any more
GENERAL: previous prod ID
Needed for a "derived production", see FAQs
GENERAL: run end, run start
the start and end run number for this production IMPORTANT: note the lines are "inverted" the start run line is the second one, IMPORTANT2: When launching a production only set the initial run range that is wanted for the first "bunch" of files to be processed. Run ranges can be extended afterwards
for the reco production is this a first pass processing or re-processing
PROD-: multicore flag
Is this step able to run in multi-core queues? (usually not for the time being)
PROD-: group size or number of input files
for merging productions this defines the GB of output needed to accumulate to start a new job (usually 5), for other productions reco/stripping these are the number of input files for a single job (reco 1, stripping can be 2)
PROD-: Max CPU time in secs
an estimate of how many CPU seconds an individual job would last. In the past used numbers are reco 1.4M, stripping 1M, merging 300K
PROD-: Output Data Storage Element
the Dirac storage element where the output data should be uploaded to, see also workflows for details
PROD-: policy for input data access
shall the input data for the job be "download"ed to the worker node (for data processing yes). Alternative is "protocol" ie. remote access
PROD-: priority
the priority of jobs for this step in respect to other data processing jobs (2-8). Usually towards the end of the workflow the priority should increase in order to e.g. "eat up" merging files more quickly. The extreme cases 1 and 9 (run immediately) should be used with caution
PROD-: production plugin name
how should Dirac handle the input files and assign them to jobs
Tip: spread the priority of the productions as much as possible over range 2 - 8 for a given workflow
ByRunWithFlush: as above but if there are not enough input files for the last job assign a last job also with less than the requested number of input files
... need more ...
"Everything Else" Productions
Description of production template for any other than data processing productions
Figure 11: "Everything Else" Production Template
Very similar to the previous template, the only difference is the freedom to describe the steps needed and map them to production types.
The production types as described in Dirac need to be listed here, see list of production types e.g. from the Job Monitor
Production steps
the steps as described in the request and their order
The remaining options are the same as above
Launching the production
Once the template has been filled in press the "Generate" button at the bottom
IMPORTANT IMPORTANT IMPORTANT, generating a production can take some time (minute(s)), refrain from pressing the 'Generate' button twice, it will generate another production and produce a mess !!!!! IMPORTANT, IMPORTANT, IMPORTANT
When the new production is generated it will appear in the 'TransformationMonitor' (see Fig 12), and can be monitored from there any further.
Figure 12: Transformation Monitor
Production Operation
Responsible: Production Manager together with the operations teams (Production Shifter, Grid Expert on Call (GEOC))
Once the production has been launched the operation of the production shall be monitored by the operations team (production shifters, GEOC). The operations team will mainly concentrate on problems produced by the production such as crashing applications, problems with input data access, grid batch system problems etc. In addition the production team needs to keep an overall eye on the production itself and especially at the flow of files through the system. These can be best monitored with the WMS monitoring of the Dirac portal and the "shifter plots" for the processing activity launched.
During major production activities, e.g. incremental stripping campaign, the production manager should also connect to the LHCb Computing Operations Meeting which usually is announced on lhcb-grid@cernNOSPAMPLEASE.ch and takes place Mo,Wed,Fri 11:30-12:00.
Operations that can be taken on a production
In the TransformationMonitor (see Fig 12) a single or set of productions can be selected and actions can be taken on them (top right in the window), these are
remaining files will be attached to jobs and executed
Complete
a production has finished its processing and can be closed successfully, the output data of the jobs remains on storage, the information about the individual jobs will be removed from Dirac
Clean
delete all information that this production has produced, including jobs and output data, e.g. for a validation production that is not needed anymore or a production that failed
Tip: In case a real production (config=LHCb) failed and needs to be cleaned it is advised to run the next production with a different processing pass as the removal of files can take long and may not be 100 % succesful.
Ending a production
Responsible: Production Manager
For ending a production one tool "dirac-production-check-descendants" is provided which will check if all files have been processed, no double processing has happened and all information is uploaded to the file catalog and bookkeeping, e.g.
[localhost, Patch] ~ $ dirac-production-check-descendants 32860,33004
Processing DataReprocessing production 32860
Looking for descendants of type ['FULL.DST']
Getting files from the TransformationSystem...
Found 32025 processed files and 0 non processed files (16.6 seconds)
Now getting daughters for 32025 processed mothers in production 32860 (chunks of 500) ................................................................. (7456.0
seconds)
Checking replicas for 31330 files (chunks of 1000)................................ (410.5 seconds)
Results:
31330 unique daughters found with real descendants
No processed LFNs with multiple descendants found -> OK!
No processed LFNs without descendants found -> OK!
No non processed LFNs with multiple descendants found -> OK!
No non processed LFNs with descendants found -> OK!
Processed production 32860 in 7905.3 seconds
Processing DataReprocessing production 33004
Looking for descendants of type ['FULL.DST']
Getting files from the TransformationSystem...
Found 7466 processed files and 0 non processed files (4.0 seconds)
Now getting daughters for 7466 processed mothers in production 33004 (chunks of 500) ............... (1557.2 seconds)
Checking replicas for 7277 files (chunks of 1000)........ (86.3 seconds)
Results:
7277 unique daughters found with real descendants
No processed LFNs with multiple descendants found -> OK!
No processed LFNs without descendants found -> OK!
No non processed LFNs with multiple descendants found -> OK!
No non processed LFNs with descendants found -> OK!
Processed production 33004 in 1655.1 seconds
FAQs
How to check the availability of the software on CVFMS?
Logon to lxplus.cern.ch or any other node that runs a CVMFS client
cd to /cvmfs/lhcb.cern.ch (needs to automount)
look for the release in the CVMFS tree, if it can be found there its likely to be deployed on the grid as well, e.g.
[pcites03] /afs/cern.ch/user/r/roiser > cd /cvmfs/lhcb.cern.ch
[pcites03] /cvmfs/lhcb.cern.ch > ls lib/lhcb/DAVINCI/DAVINCI_v33r9/InstallArea/
x86_64-slc5-gcc46-dbg x86_64-slc5-gcc46-opt x86_64-slc6-gcc46-dbg x86_64-slc6-gcc46-opt x86_64-slc6-gcc48-dbg x86_64-slc6-gcc48-opt
[pcites03] /cvmfs/lhcb.cern.ch >
What about run ranges of data processing productions?
A few tips for what concerns the run ranges for productions which contain large sets of input data
Start with a small amount of input files initially <5k. Usually a slow start doesn't harm and is less disruptive if anything goes wrong.
Whenever any of the storages goes goes down in "Staging", even before reaching 0 the run range can be extended. Usually it takes O(Hs) for the staging systems to react.
Whenever extending the run ranges these can be done in the order of 10-20 k input files, this has proven to not overload any parts of the system.
How to calculate the number of input files / pb-1 for a production?
The number of files and pb-1 of a given run range can be calculated on the command line with "dirac-bookkeeping-get-stats"
[pcites03] ~ > dirac-bookkeeping-get-stats --BKQuery=/LHCb/Collision11/Beam3500GeV-VeloClosed-MagDown/RealData/Reco14/90000000/FULL.DST --Runs=87219:87900 --DQFlags=OK
For BK query: {'ConfigVersion': 'Collision11', 'ConfigName': 'LHCb', 'ConditionDescription': 'Beam3500GeV-VeloClosed-MagDown', 'EndRun': 87900, 'EventType': '90000000', 'FileType': 'FULL.DST', 'ProcessingPass': '/Real Data/Reco14', 'Visible': 'Yes', 'DataQuality': ['OK'], 'StartRun': 87219, 'DataTakingConditions': 'Beam3500GeV-VeloClosed-MagDown'}
Getting info from files...
Nb of Files : 2'135
Nb of Events : 117'153'033
Total size : 8.861 TB (75.6 kB per evt)
Luminosity : 5.252 /pb
Size per /pb : 1687.2 GB
Files per /pb : 406.5
[pcites03] ~ >
What is a derived production and how can it be launched?