Production Management

This page shall provide information for operational procedures for production managers

Introduction

LHCb executes different kinds of job types on its distributed computing facilities (aka Grid).

  • Organized production activities. These are all data processing activities that are handled centrally by the "LHCb production team". Usually the data processed are large data data-sets, e.g. all RAW data of a given year. The organized activities can be further sub-divided into
    • Data Processing Productions: All processing of "real data", i.e. data that was collected by the detector or derivations of it (e.g. reconstructed data). Activities include the first pass processing from the detector RAW files. Re-processing of previously collected RAW data or re-/incremental-stripping of previously reconstructed data. Except of first pass processing these activities are tried to be executed with high load in the shortest possible time. The planning for such activities is planned much in advance and time scales are setup by the computing team in co-ordination with the physics teams. The processing usually concentrates on T0/1 sites but can also be extended to "attached Tier2s" if very high load is estimated.
    • Working Group Productions: An analysis or other activity that is too big for an individual user to organize and process, e.g. because the output data would not fit into its user space. The production team is able to help in setting up these processings via a production activity. The planning of these activities is less stringent.
    • Monte Carlo Productions: Simulation productions are executed throughout the year and are executed mainly on Tier2 sites but also other Tier levels if computing capacities are available.
  • User jobs: These are analysis jobs submitted by the physicists to analyze data - usually generated by productions. User jobs have the highest priority of all activities. They usually require input data and will be executed where this input data is available (Tier0/1/2D sites).

NOTE: This twiki page concentrates mainly on Data Processing Productions but can also be applied for Working Group Productions. Parts of this twiki can also be applied to Monte Carlo Productions but those are slightly different.

Prerequisites to become a production manager

To work as a production manager the following rights and roles are needed:

Production Constituents

This section describes the different constituents of a production which are provided by different roles within the collaboration (see Fig 1).

Production Workflow
Figure 1: Production Workflow

Step

Responsible: convener of the activity (e.g. stripping convener), production managers may help

A production can execute several "Step"s. Each step describes the execution of a certain program which usually will take some input data process it and produce another set of output data. Production Steps can be found in the Dirac Step Manager (see Fig 2).

Step Manager
Figure 2: Step Manager

To find a given step one can sub-select the steps in the left pane, e.g. by application type or state. Steps which are not used anymore in any workflows and possibly cannot be re-used in the future shall be marked as "Obsolete".

  • TIP Tip: In order to edit a step left-click on the individual step and select "Edit". One needs to select role "lhcb_tech"

A step is shown in Fig 3.

production-workflow.png
Figure 3: Step

The fields of the step are:

Name Description
Name an meaningful name describing the step, it has no further implications on the workflows
ProcessingPass The output on where this step will write its file in the processing pass
Application & Version The LHCb application name and the version used in this step
SystemConfig the string describing the platform combination, which binary version the application shall run on
Option files options that will be passed to the application when invoked
Options format A hint to Dirac on which application type is being executed, select from the pull down menu
Multicore can this application be executed in multicore mode (for the time being no)
Extra packages additional packages needed in addition to the regular software stack dependencies. E.g. AppConfig drives configurations of applications, SQLDDB is responsible for connecting to conditions data
Runtime project  
CondDB the conditions database tag
DDDB tag for the detector description database
DQTag ?
Visible Set to Y(es) if this step shall produce an entry in the Bookkeeping path, applies to almost all steps but e.g. Merging
Usable tag whether this step is usable in a request
Input/Output File Types The file type taken as input data and produced as output by the step, the "Visible" column denotes whether this file type is also visible in the BK, i.e. can be selected by physicists as input

  • TIP Tip: Check that the specific application and extra package version are released on the requested platform and available on CVMFS (see also FAQ)
  • TIP Tip: In order to use tags within a request (see next section) they need to be set "Y"es in the "Usable" field

Request

Responsible: Can be either done by the convener of the activity or the production manager

A request can consist of several steps and is the initial constituent from which a production is generated. Within the LHCb Request Manager (see Fig 4)

Request Manager
Figure 4: Request Manager

As for the step manager one can narrow down the requests by putting selection criteria in the left pane

  • TIP Tip: The "Show models only:" tick box only applies to monte carlo productions

Request Generation & Preparation

  • TIP Tip: In order to generate a new request, first put selection criteria in the left pane which narrow down the list of requests to a similar one you want to create. Then (as "lhcb_prmgr") left click an old request and select "Dulicate". When asked whether to keep the processing pass, this can be answered with yes, the individual steps an be changed afterwards.

Once a new request is generated or an existing "New" one shall be changed it needs to be selected in the request manager and left-click "edited". The result is a request modification page as shown in Fig 5.

Request
Figure 5: Request to edit

The fields of the request are

Name Description
Name a meaningful name of this request, is has no operational influence except that it will show up in the transformation monitor (see later)
Priority used for Monte Carlo productions, data processing productions are always priority "1a"
Inform also a list of users or email addresses to also inform about this request, can be e.g. the application convener(s)
Input Data input data for this request to be selected from the Bookkeeping (see below)
Processing Pass the individual steps of this request which constitute the input/outputs processed / generated
Comments free text for the request editors to put additonal information, e.g. run ranges, special conditions, etc

Select Input Data

Pressing the "Select Input Data" button will launch a new popup window (see Fig 6) in order to select the data from the Bookkeeping for this request.

Request Select Input Data
Figure 6: Select Input Data

Once a final leave of the tree is reached the "Select" button will become active. Some tips for what concerns the different constituents of the input data path

  • Config for production data is in the sub-tree "LHCb". Another commonly used config is "validation" for any pre-production and verification activities.
  • version denotes the "activity" when taking this data, e.g. "Collision12" for pp collision data in 2012. Collision12_25 the same as above but with 25 ns bunch spacing.
  • Conditions usually look like "--. The Beam energy in 2011 and 2012 was mostly of one kind 3500 and 4000 respectively. The Velo state for data processing activities is always "VeloClosed". The Magnet polarity often switches between "MagUp" and "MagDown".
  • Processing Pass denotes the processing of the data with the different versions of reconstruction, stripping etc programs and their tree within
  • DQ Flag, the data quality of the event data collected from the detector. The DQ team would flag data after some first processing when coming out of the detector.
  • File Type, the kind of files used for the processing, e.g. RAW for reconstruction, FULL.DST for data stripping, etc.
  • Production, ?
  • TCK, the trigger configuration used when taking the data, usually for data processing productions this will be set to "ALL"

Change Processing Pass

The individual steps of the processing pass of a request can be added / changed / deleted in the lower left part of the request. When working with steps a new popup window will appear (see Fig 7) that allows the selection of a step.

Request Change Processing Pass
Figure 7: Change Processing Pass

The individual steps of a request can be changed, added, deleted as needed. This will follow the workflows as produced in the step manager and described in the LHCb workflows twiki.

  • TIP Tip: when selecting the first step of a request (also when changing) tick the "Show also non-coinciding steps".

When selecting the second and later steps only those steps will be listed that match as input the output of the previous step.

Request Signing

Once a request has been configured correctly it needs to go through a signing process, which is

Request State Role needed Action needed New Request State
New lhcb_tech Submit to production team Submitted
Submitted lhcb_tech Sign tech Tech OK
Tech OK lhcb_ppg Sign ppg Accepted
Accepted lhcb_prmgr Activate Active

For the different signatures the roles need to be changed accordingly. An example is shown in Fig 8. with the change of role (bottom) and left click "Sign" the request can be signed as "lhcb_tech" in the upcoming window.

Signing Request as lhcb_tech
Figure 8: Sign lhcb_tech

Production Generation

Responsible: Production Manager

Once a request has reached the state "Active" a production can be launched from it. With role "lhcb_prmgr" "edit" the production, then press "Generate". The first popup window will allow the selection of the "template" of production (see Fig 9). The types relevant for this twiki are:

  • RecoStripping_run.py for data processing activities such as reconstruction, stripping, re-stripping, incremental stripping, etc.
  • everyThingElse_run.py for any other production type, e.g. Working Group productions, DataSwimming, etc.

Selecting a production template
Figure 9: Production template selection

Reconstruction / Stripping Production

Description of reconstruction / stripping productions

Reconstruction / Stripping Production Template
Figure 10: Reco/Stripping Production Template

Name Description
WORKFLOW Only one of those workflows needs to be set to "True"
GENERAL: Set True for certification test Not used by production managers (certification of new Dirac versions)
GENERAL: Set True for local test Set "True" if a production shall be launched from the command line (see later)
GENERAL: Set True for validation prod Set "True" for pre-productions, the output of this production will go into config "validation" in the Bookkeeping, this can be useful e.g. for tests done before a big production campaign
GENERAL: Workflow destination site Usually left at "ALL" to let Dirac decide on where to launch the jobs of this production. In special cases a certain site can be selected (use the "LHCb name", e.g. "LCG.CNAF.it"
GENERAL: Workflow string to append to the production name In case a production has been launched already from this request this number needs to be increased by 1 in order to make the generated production name unique, also in case a production produced errors and needs to be deleted and re-launched
GENERAL: discrete list of run numbers a comma delimited (?) list of runs this production shall run on
GENERAL: extra options as python dict a python dictionary which will add extra options to the different steps of this production, e.g. increase the verbosity level of the Brunel step1 as {1:"from Configurables import Brunel; Brunel().Monitors += [ \'NameAuditor\' ]"}
GENERAL: fraction to process per run in case only parts of a run shall be processed, but the integer percentage number here (e.g. 50 for half of the files). Note this was used end of 2012 for first pass processing and DQ flagging. Not sure this will be used again
GENERAL: minimum number of files to process per run Integer number of minimal files to process in case above is set. Again used for 2012 first pass processing, probably not any more
GENERAL: previous prod ID Needed for a "derived production", see FAQs
GENERAL: run end, run start the start and end run number for this production IMPORTANT: note the lines are "inverted" the start run line is the second one, IMPORTANT2: When launching a production only set the initial run range that is wanted for the first "bunch" of files to be processed. Run ranges can be extended afterwards
PROD-1: DataReconstruction or DataReprocessing for the reco production is this a first pass processing or re-processing
PROD-: multicore flag Is this step able to run in multi-core queues? (usually not for the time being)
PROD-: group size or number of input files for merging productions this defines the GB of output needed to accumulate to start a new job (usually 5), for other productions reco/stripping these are the number of input files for a single job (reco 1, stripping can be 2)
PROD-: Max CPU time in secs an estimate of how many CPU seconds an individual job would last. In the past used numbers are reco 1.4M, stripping 1M, merging 300K
PROD-: Output Data Storage Element the Dirac storage element where the output data should be uploaded to, see also workflows for details
PROD-: policy for input data access shall the input data for the job be "download"ed to the worker node (for data processing yes). Alternative is "protocol" ie. remote access
PROD-: priority the priority of jobs for this step in respect to other data processing jobs (2-8). Usually towards the end of the workflow the priority should increase in order to e.g. "eat up" merging files more quickly. The extreme cases 1 and 9 (run immediately) should be used with caution
PROD-: production plugin name how should Dirac handle the input files and assign them to jobs

  • TIP Tip: spread the priority of the productions as much as possible over range 2 - 8 for a given workflow

Examples for production plugins:

  • ByRun: Only process files of a given run together
  • ByRunWithFlush: as above but if there are not enough input files for the last job assign a last job also with less than the requested number of input files
  • ... need more ...

"Everything Else" Productions

Description of production template for any other than data processing productions

Everything Else Production Type
Figure 11: "Everything Else" Production Template

Very similar to the previous template, the only difference is the freedom to describe the steps needed and map them to production types.

Name Description
List here which prods you want to create The production types as described in Dirac need to be listed here, see list of production types e.g. from the Job Monitor
Production steps the steps as described in the request and their order

The remaining options are the same as above

Launching the production

Once the template has been filled in press the "Generate" button at the bottom

IMPORTANT IMPORTANT IMPORTANT, generating a production can take some time (minute(s)), refrain from pressing the 'Generate' button twice, it will generate another production and produce a mess !!!!! IMPORTANT, IMPORTANT, IMPORTANT

When the new production is generated it will appear in the 'TransformationMonitor' (see Fig 12), and can be monitored from there any further.

Transformation Monitor
Figure 12: Transformation Monitor

Production Operation

Responsible: Production Manager together with the operations teams (Production Shifter, Grid Expert on Call (GEOC))

Once the production has been launched the operation of the production shall be monitored by the operations team (production shifters, GEOC). The operations team will mainly concentrate on problems produced by the production such as crashing applications, problems with input data access, grid batch system problems etc. In addition the production team needs to keep an overall eye on the production itself and especially at the flow of files through the system. These can be best monitored with the WMS monitoring of the Dirac portal and the "shifter plots" for the processing activity launched.

During major production activities, e.g. incremental stripping campaign, the production manager should also connect to the LHCb Computing Operations Meeting which usually is announced on lhcb-grid@cernNOSPAMPLEASE.ch and takes place Mo,Wed,Fri 11:30-12:00.

Operations that can be taken on a production

In the TransformationMonitor (see Fig 12) a single or set of productions can be selected and actions can be taken on them (top right in the window), these are

Name Description
Start start a previously stopped production
Stop no more tasks will be generated
Flush remaining files will be attached to jobs and executed
Complete a production has finished its processing and can be closed successfully, the output data of the jobs remains on storage, the information about the individual jobs will be removed from Dirac
Clean delete all information that this production has produced, including jobs and output data, e.g. for a validation production that is not needed anymore or a production that failed

  • TIP Tip: In case a real production (config=LHCb) failed and needs to be cleaned it is advised to run the next production with a different processing pass as the removal of files can take long and may not be 100 % succesful.

Ending a production

Responsible: Production Manager

For ending a production one tool "dirac-production-check-descendants" is provided which will check if all files have been processed, no double processing has happened and all information is uploaded to the file catalog and bookkeeping, e.g.

[localhost, Patch] ~ $ dirac-production-check-descendants 32860,33004
Processing DataReprocessing production 32860 
Looking for descendants of type ['FULL.DST'] 
Getting files from the TransformationSystem... 
Found 32025 processed files and 0 non processed files (16.6 seconds) 
Now getting daughters for 32025 processed mothers in production 32860 (chunks of 500) ................................................................. (7456.0 
seconds)
Checking replicas for 31330 files (chunks of 1000)................................ (410.5 seconds)

Results: 
31330 unique daughters found with real descendants 
No processed LFNs with multiple descendants found -> OK! 
No processed LFNs without descendants found -> OK! 
No non processed LFNs with multiple descendants found -> OK! 
No non processed LFNs with descendants found -> OK! 
Processed production 32860 in 7905.3 seconds 
Processing DataReprocessing production 33004 
Looking for descendants of type ['FULL.DST'] 
Getting files from the TransformationSystem... 
Found 7466 processed files and 0 non processed files (4.0 seconds) 
Now getting daughters for 7466 processed mothers in production 33004 (chunks of 500) ............... (1557.2 seconds)
Checking replicas for 7277 files (chunks of 1000)........ (86.3 seconds)

Results: 
7277 unique daughters found with real descendants 
No processed LFNs with multiple descendants found -> OK! 
No processed LFNs without descendants found -> OK! 
No non processed LFNs with multiple descendants found -> OK! 
No non processed LFNs with descendants found -> OK! 
Processed production 33004 in 1655.1 seconds

FAQs

How to check the availability of the software on CVFMS?

  • Logon to lxplus.cern.ch or any other node that runs a CVMFS client
  • cd to /cvmfs/lhcb.cern.ch (needs to automount)
  • look for the release in the CVMFS tree, if it can be found there its likely to be deployed on the grid as well, e.g.

[pcites03] /afs/cern.ch/user/r/roiser > cd /cvmfs/lhcb.cern.ch
[pcites03] /cvmfs/lhcb.cern.ch > ls lib/lhcb/DAVINCI/DAVINCI_v33r9/InstallArea/
x86_64-slc5-gcc46-dbg  x86_64-slc5-gcc46-opt  x86_64-slc6-gcc46-dbg  x86_64-slc6-gcc46-opt  x86_64-slc6-gcc48-dbg  x86_64-slc6-gcc48-opt
[pcites03] /cvmfs/lhcb.cern.ch > 

What about run ranges of data processing productions?

A few tips for what concerns the run ranges for productions which contain large sets of input data

  • Start with a small amount of input files initially <5k. Usually a slow start doesn't harm and is less disruptive if anything goes wrong.
  • Whenever any of the storages goes goes down in "Staging", even before reaching 0 the run range can be extended. Usually it takes O(Hs) for the staging systems to react.
  • Whenever extending the run ranges these can be done in the order of 10-20 k input files, this has proven to not overload any parts of the system.

How to calculate the number of input files / pb-1 for a production?

The number of files and pb-1 of a given run range can be calculated on the command line with "dirac-bookkeeping-get-stats"

[pcites03] ~ > dirac-bookkeeping-get-stats --BKQuery=/LHCb/Collision11/Beam3500GeV-VeloClosed-MagDown/RealData/Reco14/90000000/FULL.DST --Runs=87219:87900 --DQFlags=OK
For BK query: {'ConfigVersion': 'Collision11', 'ConfigName': 'LHCb', 'ConditionDescription': 'Beam3500GeV-VeloClosed-MagDown', 'EndRun': 87900, 'EventType': '90000000', 'FileType': 'FULL.DST', 'ProcessingPass': '/Real Data/Reco14', 'Visible': 'Yes', 'DataQuality': ['OK'], 'StartRun': 87219, 'DataTakingConditions': 'Beam3500GeV-VeloClosed-MagDown'}
Getting info from files...
Nb of Files      : 2'135
Nb of Events     : 117'153'033
Total size       : 8.861 TB (75.6 kB per evt)
Luminosity       : 5.252 /pb
Size  per /pb    : 1687.2 GB
Files per /pb    : 406.5

[pcites03] ~ > 

What is a derived production and how can it be launched?

-- StefanRoiser - 18 Feb 2014

Topic attachments
I Attachment History Action Size Date WhoSorted ascending Comment
PNGpng ProdGen.png r1 manage 281.3 K 2014-02-18 - 14:39 StefanRoiser Production Workflow
PNGpng ProdGen2EverythingElse.png r1 manage 531.8 K 2014-02-18 - 14:42 StefanRoiser Production Everything Else
PNGpng ProdGen2RecoStrip.png r1 manage 531.6 K 2014-02-18 - 14:49 StefanRoiser Productiion Reco/Stripping
PNGpng Request.png r1 manage 308.0 K 2014-02-18 - 14:49 StefanRoiser Request
PNGpng RequestManager.png r1 manage 480.5 K 2014-02-18 - 14:50 StefanRoiser Request Manager
PNGpng RequestReplaceStep.png r1 manage 309.5 K 2014-02-18 - 14:50 StefanRoiser Request Replace Step
PNGpng RequestSelectInputData.png r1 manage 290.9 K 2014-02-18 - 14:50 StefanRoiser Request Select Input Data
PNGpng RequestSignTech.png r1 manage 442.8 K 2014-02-18 - 14:52 StefanRoiser Request Sign as lhcb_tech
PNGpng Step.png r1 manage 178.4 K 2014-02-18 - 14:52 StefanRoiser Step
PNGpng StepManager.png r1 manage 342.4 K 2014-02-18 - 14:56 StefanRoiser Step Manager
PNGpng TransformationMonitor.png r1 manage 276.2 K 2014-02-18 - 14:55 StefanRoiser Transformation Monitor
PNGpng production-workflow.png r1 manage 60.4 K 2014-02-18 - 14:55 StefanRoiser Production Workflow
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2018-09-23 - MarcoCattaneo
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback