LHCb Production Operations Procedures

This is the LHCb Production Operations Procedures page which contains procedures for Grid Experts and Grid Shifters to follow. More information about Production Operations and some useful links can be found at the Production Operations page. The draft template for defining production procedures is available here in the following formats: Template (.doc), Template (.rtf), Template (.txt). Each procedure should be posted with a link to the documentation and the desired audience e.g. Grid Expers, Grid Shifters or both.

How to derive a production

Typically, when a new application version is released and should substitute the current one, it's very useful to derive a production

How to rerun a production job starting from a log SE link

Frequently individual jobs can fail with an error that should be investigated by applications experts. The following guide on how to rerun a job can be circulated in case of questions by the applications expert.

Getting production to 100%

Many times, it seems that is very easy for a production to reach 95%, but what is difficult is to reach 100%. A list of cases can be found in this link. (Mostly for Grid Experts and Production Manager, but Grid shifters can still grasp useful information)

Closing a production

It is very cumbersome to keep in the production system old production, may be still active generating unduly load on various component of the Production System like for example the BookkeepingWatchAgent that will also loop on these not longer useful productions stretching the time to create tasks for effectively active productions. At this link a procedure to pick up and close not longer alive production is provided.

Pilots monitor

If jobs are not being submitted for a long time, you can check first of all if pilots are submitted, and then if they are actually matched. First, you can look in the portal in the "Pilot monitor" page, to see if there pilots running or submitted. Then, with the command

dirac-admin-get-job-pilots jobID

you check if pilots are submitted, for you job queue. This will print the logs for the pilots in the queue. If you don't see a line with

'Status': 'Submitted'

then it might be that there is a problem.

Also, through the pilot monitor page you can see the pilot output for the "done" pilots, that can contain useful information of why the pilots might not be matched.

Data Management

How to fix screwed up replication transformations

Using the dirac-transformation-debug, instructions here

Transfer PIT - CASTOR

The Data transfer betwen the PIT and CASTOR for the RAW is handle on the machine lbdirac.cern.ch by the user lhcbprod. The dirac installation is done under /sw/dirac/data-taking. The transfer itself is managed by the Agent /sw/dirac/data-taking/startup/DataManagement_transferAgent. This python process should run MaxProcess processes and each process can start a new process for each transfer (MaxProcess can be found in /sw/dirac/data-taking/etc/DataManagement_TransferAgent.cfg). If you don't see too many processes, you can look at the log /sw/dirac/data-taking/DataManagement_TransferAgent/log/current. A typical behaviour can be seen here.

You can also look at this web page to spot a potentiel problem if you see that the rate decrease. In principle in normal condition of data taking period, it means that one or several processes are stuck. you can find them with strace -f -pid _PID_. As soon as you find it you can kill it kill -9 _PID_. If it has no effect, you can stop the agent in a proper way touch /sw/dirac/data-taking/control/DataManagement/TransferAgent/stop_agent. If it does not produce any effect, you can finnalyy try runsvctrl t /sw/dirac/data-taking/startup/DataManagement_TransferAgent. As last resort, you will have to kill it by hand kill -9 _PID_

You can apply the recipe for the RemovalAgent.

Job Data Access Issues

The following document is meant to give Grid shifters a few hints on things they can check when a job has problem accessing data on a site. (Grid Shifters and Grid Experts). Check list to debug dcache and CASTOR issues are available also.

Staging request blocked

If there are some STAGEIN request blocked you can follow the recipe (http://lblogbook.cern.ch/Operations/4647) to recover the situation

Changing the Data manager

It happens more frequently than one expects the need of swapping the identity of the LHCb Data Manager. In this procedure the steps to accomplish smoothly this operation are described.

File recovery

How to recover replicas that are lost even if SRM reports they are existing

This can happen. The file is physically lost but SRM (lcg-ls) reports the file is there, see this GGUS. This replica is totally lost from tape and disk:

 > lcg-ls -l srm://gridka-dCache.fzk.de/pnfs/gridka.de/lhcb/data/2011/RAW/FULL/LHCb/COLLISION11/98298/098298_0000000077.raw
-rw-r--r--   1     2     2 3145768992             NEARLINE /pnfs/gridka.de/lhcb/data/2011/RAW/FULL/LHCb/COLLISION11/98298/098298_0000000077.raw
        * Checksum: 51e2fc3d (adler32)
        * Space tokens: 39930230

You have then to remove the lost replicas and then copy them over again from other another site:

$ dirac-dms-remove-lfn-replica /lhcb/data/2011/RAW/FULL/LHCb/COLLISION11/98298/098298_0000000077.raw GRIDKA-RAW
$ dirac-dms-replicate-lfn  /lhcb/data/2011/RAW/FULL/LHCb/COLLISION11/98298/098298_0000000077.raw GRIDKA-RAW

If there is only one replica and the corresponding file at the site has been lost completely, then you need to use dirac-dms-remove-files to remove the entry in the replica catalogue. You need to double check this is really the case, as this command will remove all replicas of the given file!

Changing the Default Protocols List for a given Site (Tier-1)

The order of the list of protocols supplied to SRM can be changed e.g. testing root at NIKHEF means root is prepended to the list. This guide explains how to change the protocols list for a given site. This operation is restricted to those with the diracAdmin role. (Grid Experts)

Checking the throughput from the pit to Castor (during data taking)

The link band-with is 10GBit. Expected rate (beginning of 2012) is about 280 MB/s, some more details here.

Site Availability Monitoring (SAM) tests


SAM tests are used by LHCb to check that basic functionality of each Grid site is working. They run a few times each day at each site. The SAM framework has been updated in DIRAC3 and now runs as a specialized workflow with tailored modules for each functional test. Running the SAM tests is restricted to Grid Experts having the lcgadmin VOMS role.

Sites Troubleshooting

Investigating failed jobs

Site Mail contacts

WLCG "conventional" site-support mailing list are available here.

Daily Shifter Checklist

  • Each day the shifter should routinely check the following items to ensure the smooth running of distributed computing for LHCb.

End of production Checklist

Miscellaneous

Feature Requests and Bug Reports

Bookkeeping System

Set the data quality flag

  • You know the data quality flag you can use: dirac-bookkeeping-setdataquality-run or dirac-bookkeeping-setdataquality-files. The commands without input parameter shows the available quality flags and how to use this command.
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[406]>dirac-bookkeeping-setdataquality-run
Available data quality flags:
UNCHECKED
OK
BAD
MAYBE
Usage: dirac-bookkeeping-setdataquality-run.py <RunNumber> <DataQualityFlag>

  • The following commands you can use to set the data quality flag:

dirac-bookkeeping-setdataquality-run

The input parameters is the run number and the data quality flag. if you want to know the data quality flag, you have to use this command without input parameters.
for example:

(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[406]>dirac-bookkeeping-setdataquality-run
Available data quality flags:
UNCHECKED
OK
BAD
MAYBE
Usage: dirac-bookkeeping-setdataquality-run.py <RunNumber> <DataQualityFlag>

The data quality flag is case sensitive. Set data quality a given run:

(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[408]>dirac-bookkeeping-setdataquality-run 20716 'BAD'
Quality flag has been updated!
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[409]>

dirac-bookkeeping-setdataquality-files

The input is a logical file name or a file. This file contains a list of lfns. Set the quality flag one file:

(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[413]> dirac-bookkeeping-setdataquality-files /lhcb/data/2009/RAW/EXPRESS/FEST/FEST/44026/044026_0000000002.raw 'BAD'
['/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/44026/044026_0000000002.raw']
Quality flag updated!
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[414]>

Set the quality flag a list of file:

(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[416]> dirac-bookkeeping-setdataquality-files lfns.txt 'BAD'
['/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/44026/044026_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43998/043998_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43995/043995_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43994/043994_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43993/043993_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43992/043992_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43989/043989_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43987/043987_0000000002.raw']
Quality flag updated!
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[417]>

The lfns.txt contains the following:

/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/44026/044026_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43998/043998_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43995/043995_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43994/043994_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43993/043993_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43992/043992_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43989/043989_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43987/043987_0000000002.raw
lfns.txt (END)

Documents

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf DataAccessProblems.pdf r4 r3 r2 r1 manage 1563.0 K 2008-09-09 - 17:01 NickBrook data access check list
PDFpdf LHCbSoftwareDeployment.pdf r1 manage 198.4 K 2008-08-04 - 14:43 StuartPaterson  
Microsoft Word filedoc ProdOpsProcedureTemplate.doc r1 manage 96.0 K 2008-07-31 - 15:45 StuartPaterson Template (.doc)
Microsoft Word filertf ProdOpsProcedureTemplate.rtf r1 manage 0.7 K 2008-08-18 - 00:17 StuartPaterson  
Texttxt ProdOpsProcedureTemplate.txt r1 manage 0.2 K 2008-08-18 - 00:16 StuartPaterson  
Microsoft Word filedoc RecoveryOfFilesLostBySE.doc r1 manage 103.5 K 2008-07-31 - 21:45 StuartPaterson  
PDFpdf SAMProcedure010808.pdf r2 r1 manage 465.5 K 2008-08-01 - 11:38 StuartPaterson  
PNGtiff Transfer_online_ps.tiff r1 manage 510.6 K 2012-04-17 - 17:07 JoelClosier Transfer_online_nbprocesses
PDFpdf dirac-primary-states.pdf r1 manage 32.3 K 2008-09-15 - 12:44 GreigCowan DIRAC primary job states
PNGpng dirac-primary-states.png r1 manage 109.0 K 2008-09-15 - 12:45 GreigCowan DIRAC primary job states
PDFpdf feature_request_and_bug_submission.pdf r1 manage 195.0 K 2008-09-05 - 13:23 PaulSzczypka Procedure to submit feature requests and report bugs.
Edit | Attach | Watch | Print version | History: r91 < r90 < r89 < r88 < r87 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r91 - 2020-08-10 - FedericoStagni
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback