LHCb Production Operations Procedures
This is the LHCb Production Operations Procedures page which contains procedures for Grid Experts and Grid Shifters to follow. More information about Production Operations and some useful links can be found at the
Production Operations page. The draft template for defining production procedures is available here in the following formats:
Template (.doc),
Template (.rtf),
Template (.txt). Each procedure should be posted with a link to the documentation and the desired audience e.g. Grid Expers, Grid Shifters or both.
How to derive a production
Typically, when a new application version is released and should substitute the current one, it's very useful to
derive a production
How to rerun a production job starting from a log SE link
Frequently individual jobs can fail with an error that should be investigated by applications experts. The following
guide on how to rerun a job can be circulated in case of questions by the applications expert.
Getting production to 100%
Many times, it seems that is very easy for a production to reach 95%, but what is difficult is to reach 100%. A list of cases can be found in
this link. (Mostly for Grid Experts and Production Manager, but Grid shifters can still grasp useful information)
Closing a production
It is very cumbersome to keep in the production system old production, may be still active generating unduly load on various component of the Production System like for example the
BookkeepingWatchAgent that will also loop on these not longer useful productions stretching the time to create tasks for effectively active productions. At
this link a procedure to pick up and close not longer alive production is provided.
Pilots monitor
If jobs are not being submitted for a long time, you can check first of all if pilots are submitted, and then if they are actually matched. First, you can look in the portal in the "Pilot monitor" page, to see if there pilots running or submitted. Then, with the command
dirac-admin-get-job-pilots jobID
you check if pilots are submitted, for you job queue. This will print the logs for the pilots in the queue. If you don't see a line with
'Status': 'Submitted'
then it might be that there is a problem.
Also, through the pilot monitor page you can see the pilot output for the "done" pilots, that can contain useful information of why the pilots might not be matched.
Data Management
How to fix screwed up replication transformations
Using the
dirac-transformation-debug
, instructions
here
Transfer PIT - CASTOR
The Data transfer betwen the PIT and CASTOR for the RAW is handle on the machine lbdirac.cern.ch by the user lhcbprod. The dirac installation is done under /sw/dirac/data-taking. The transfer itself is managed by the Agent /sw/dirac/data-taking/startup/DataManagement_transferAgent. This python process should run
MaxProcess processes and each process can start a new process for each transfer (
MaxProcess can be found in /sw/dirac/data-taking/etc/DataManagement_TransferAgent.cfg). If you don't see too many processes, you can look at the log /sw/dirac/data-taking/DataManagement_TransferAgent/log/current. A typical behaviour can be seen
here.
You can also look at this
web page
to spot a potentiel problem if you see that the rate decrease. In principle in normal condition of data taking period, it means that one or several processes are stuck. you can find them with
strace -f -pid _PID_. As soon as you find it you can kill it
kill -9 _PID_. If it has no effect, you can stop the agent in a proper way
touch /sw/dirac/data-taking/control/DataManagement/TransferAgent/stop_agent. If it does not produce any effect, you can finnalyy try
runsvctrl t /sw/dirac/data-taking/startup/DataManagement_TransferAgent. As last resort, you will have to kill it by hand
kill -9 _PID_
You can apply the recipe for the
RemovalAgent.
Job Data Access Issues
The following
document is meant to give Grid shifters a few hints on things they can check when a job has problem accessing data on a site. (Grid Shifters and Grid Experts).
Check list to debug
dcache 
and
CASTOR
issues are available also.
Staging request blocked
If there are some STAGEIN request blocked you can follow the recipe (
http://lblogbook.cern.ch/Operations/4647
) to recover the situation
Changing the Data manager
It happens more frequently than one expects the need of swapping the identity of the LHCb Data Manager. In this
procedure the steps to accomplish smoothly this operation are described.
File recovery
How to recover replicas that are lost even if SRM reports they are existing
This can happen. The file is physically lost but SRM (lcg-ls) reports the file is there, see this
GGUS
. This replica is totally lost from tape and disk:
> lcg-ls -l srm://gridka-dCache.fzk.de/pnfs/gridka.de/lhcb/data/2011/RAW/FULL/LHCb/COLLISION11/98298/098298_0000000077.raw
-rw-r--r-- 1 2 2 3145768992 NEARLINE /pnfs/gridka.de/lhcb/data/2011/RAW/FULL/LHCb/COLLISION11/98298/098298_0000000077.raw
* Checksum: 51e2fc3d (adler32)
* Space tokens: 39930230
You have then to remove the lost replicas and then copy them over again from other another site:
$ dirac-dms-remove-lfn-replica /lhcb/data/2011/RAW/FULL/LHCb/COLLISION11/98298/098298_0000000077.raw GRIDKA-RAW
$ dirac-dms-replicate-lfn /lhcb/data/2011/RAW/FULL/LHCb/COLLISION11/98298/098298_0000000077.raw GRIDKA-RAW
If there is only one replica and the corresponding file at the site has been lost completely, then you need to use dirac-dms-remove-files to remove the entry in the replica catalogue. You need to double check this is really the case, as this command will remove all replicas of the given file!
Changing the Default Protocols List for a given Site (Tier-1)
The order of the list of protocols supplied to SRM can be changed e.g. testing root at NIKHEF means root is prepended to the list. This guide explains
how to change the protocols list for a given site. This operation is restricted to those with the diracAdmin role. (Grid Experts)
Checking the throughput from the pit to Castor (during data taking)
The link band-with is 10GBit. Expected rate (beginning of 2012) is about 280 MB/s, some more details
here.
Site Availability Monitoring (SAM) tests
SAM tests are used by LHCb to check that basic functionality of each Grid site is working. They run a few times each day at each site. The
SAM framework has been updated in DIRAC3 and now runs as a specialized workflow with tailored modules for each functional test. Running the
SAM tests is restricted to Grid Experts having the lcgadmin VOMS role.
Sites Troubleshooting
Investigating failed jobs
WLCG "conventional" site-support mailing list are available
here.
Daily Shifter Checklist
- Each day the shifter should routinely check the following items to ensure the smooth running of distributed computing for LHCb.
End of production Checklist
Miscellaneous
Feature Requests and Bug Reports
Bookkeeping System
Set the data quality flag
- You know the data quality flag you can use: dirac-bookkeeping-setdataquality-run or dirac-bookkeeping-setdataquality-files. The commands without input parameter shows the available quality flags and how to use this command.
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[406]>dirac-bookkeeping-setdataquality-run
Available data quality flags:
UNCHECKED
OK
BAD
MAYBE
Usage: dirac-bookkeeping-setdataquality-run.py <RunNumber> <DataQualityFlag>
- The following commands you can use to set the data quality flag:
dirac-bookkeeping-setdataquality-run
The input parameters is the run number and the data quality flag. if you want to know the data quality flag, you have to use this command without input parameters.
for example:
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[406]>dirac-bookkeeping-setdataquality-run
Available data quality flags:
UNCHECKED
OK
BAD
MAYBE
Usage: dirac-bookkeeping-setdataquality-run.py <RunNumber> <DataQualityFlag>
The data quality flag is case sensitive. Set data quality a given run:
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[408]>dirac-bookkeeping-setdataquality-run 20716 'BAD'
Quality flag has been updated!
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[409]>
dirac-bookkeeping-setdataquality-files
The input is a logical file name or a file. This file contains a list of lfns. Set the quality flag one file:
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[413]> dirac-bookkeeping-setdataquality-files /lhcb/data/2009/RAW/EXPRESS/FEST/FEST/44026/044026_0000000002.raw 'BAD'
['/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/44026/044026_0000000002.raw']
Quality flag updated!
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[414]>
Set the quality flag a list of file:
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[416]> dirac-bookkeeping-setdataquality-files lfns.txt 'BAD'
['/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/44026/044026_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43998/043998_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43995/043995_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43994/043994_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43993/043993_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43992/043992_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43989/043989_0000000002.raw', '/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43987/043987_0000000002.raw']
Quality flag updated!
(DIRAC3-user) zmathe@pclhcb43 /scratch/zmathe/dirac[417]>
The lfns.txt contains the following:
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/44026/044026_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43998/043998_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43995/043995_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43994/043994_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43993/043993_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43992/043992_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43989/043989_0000000002.raw
/lhcb/data/2009/RAW/EXPRESS/FEST/FEST/43987/043987_0000000002.raw
lfns.txt (END)
Documents