! Tier-0 Operator Manual
Document Description
This is intended to be a
mostly complete reference of procedures a Tier-0 operator will need to perform as part of his duties. Please try to write down any new procedure that comes up in day to day operations.
Error Recovery
Corrupted Unmerged files
Symptom
Paused merge jobs with a failure of type
FileReadError with messages similar to:
An exception of category 'FileReadError' occurred while
[0] Reading branch EventAuxiliary
Exception Message:
Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
basket: has fNevBuf=0 but fEntryOffset=0, pos=1254725213, len=2179, fNbytes=0, fObjlen=0, trying to repair
Debugging
- Verify the integrity of the file -- TODO: Expand on this
Solution
Environment
LFN=/store/t0temp/data/WMAT0Commissioning/VBF1Parked/RAW/v4/000/208/390/00000/4CB0301F-343C-E211-B4A0-5404A63886A9.root # Problematic unmerged file
WORKFLOW=Repack_Run208390_StreamA # Affected workflow
COUCHURL=vocms15.cern.ch:5984 # Couch server for the agent
- Find the job that produced the unmerged file with the following url, extract the job id and retry count by looking at the id field in the response.
http://$COUCHURL/wmagent_jobdump%2Ffwjrs/_design/FWJRDump/_view/jobsByOutputLFN?key=[%22$WORKFLOW%22,%20%22$LFN%22]
Example output:
{"total_rows":67735,"offset":37343,"rows":[
{"id":"249771-0","key":["Repack_Run208390_StreamA","/store/t0temp/data/WMAT0Commissioning/VBF1Parked/RAW/v4/000/208/390/00000/4CB0301F-343C-E211-B4A0-5404A63886A9.root"],"value":249771}
]}
JOBID=249771
RETRYCOUNT=0
- Get the LogArch from the job that produced the unmerged file, with the job id from the previous step go to the following url:
http://$COUCHURL/wmagent_jobdump%2Ffwjrs/_design/FWJRDump/_view/logArchivesByJobID?key=[$JOBID, $RETRYCOUNT]
Example output:
{"total_rows":12143,"offset":561,"rows":[
{"id":"249771-0","key":[249771,0],"value":{"lfn":"/store/t0temp/data/logs/prod/2012/12/2/Repack_Run208390_StreamA/Repack/0000/0/Repack-ffc1c5de-3c33-11e2-844d-003048caaace-0-logArchive.tar.gz","retrycount":0,"location":"T2_CH_CERN"}}
]}
LOGARCH=/store/t0temp/data/logs/prod/2012/12/2/Repack_Run208390_StreamA/Repack/0000/0/Repack-ffc1c5de-3c33-11e2-844d-003048caaace-0-logArchive.tar.gz
- Stage the logArch and get the PSet.py and PSet.pkl, e.g.
lxplus310:example> cmsStage /store/t0temp/data/logs/prod/2012/12/2/Repack_Run208390_StreamA/Repack/0000/0/Repack-ffc1c5de-3c33-11e2-844d-003048caaace-0-logArchive.tar.gz .
lxplus310:example> tar -xzf Repack-ffc1c5de-3c33-11e2-844d-003048caaace-0-logArchive.tar.gz
lxplus310:example> ls
cmsRun1 Repack-ffc1c5de-3c33-11e2-844d-003048caaace-0-logArchive.tar.gz stageOut1
lxplus310:example> ls cmsRun1/
cmsRun1-stderr.log cmsRun1-stdout.log FrameworkJobReport.xml PSet.pkl PSet.py Report.pkl scramOutput.log
- Deploy the appropriate CMSSW area and cmsRun the PSet.py, Note: You have to be cmsprod to reproduce Repack and Express jobs, e.g.
lxplus310:example> scram project CMSSW_5_2_7
lxplus310:example> cd CMSSW_5_2_7/src/
lxplus310:src> cmsenv
lxplus310:src> cmsRun -e PSet.py &
- Get the appropriate output file you want to replace, in this example is from the VBF1Parked PD. So we want write_VBF1Parked_RAW.root
- Do basic content checks and compare against what the FWJR says about the original file:
FWJR information:
http://$COUCHURL/wmagent_jobdump%2Ffwjrs/$JOBID-RETRYCOUNT
[{"branch_hash":"07d89c50739e492bf528d39b44aeec85","user_dn":null,"lfn":"/store/t0temp/data/WMAT0Commissioning/VBF1Parked/RAW/v4/000/208/390/00000/4CB0301F-343C-E211-B4A0-5404A63886A9.root","dataset":{"applicationName":"cmsRun","applicationVersion":"CMSSW_5_2_7","processedDataset":"WMAT0Commissioning-v4","dataTier":"RAW","primaryDataset":"VBF1Parked"},"InputPFN":"/pool/lsf/cmsprod/337168096/job/WMTaskSpace/cmsRun1/write_VBF1Parked_RAW.root","checksums":{"adler32":"fcb9bea9","cksum":"1744879610"},"guid":"4CB0301F-343C-E211-B4A0-5404A63886A9","size":1953063303,"acquisitionEra":null,"configURL":"None;;None;;None","location":"T2_CH_CERN","async_dest":null,"events":4380,"validStatus":"PRODUCTION","ouput_module_class":"PoolOutputModule","globalTag":"NOTSET","custodialSite":null,"pfn":"/pool/lsf/cmsprod/337168096/job/WMTaskSpace/cmsRun1/write_VBF1Parked_RAW.root","catalog":"","module_label":"write_VBF1Parked_RAW","inputPath":null,"StageOutCommand":"rfcp-CERN","runs":{"208390":[206]},"OutputPFN":"root://eoscms//eos/cms/store/t0temp/data/WMAT0Commissioning/VBF1Parked/RAW/v4/000/208/390/00000/4CB0301F-343C-E211-B4A0-5404A63886A9.root","user_vogroup":"DEFAULT","user_vorole":"DEFAULT","processingVer":null}]
See EVENTS=4380, RUNLUMIS={"208390":[206]}
Replacement file information:
edmFileUtil -e file://$PATHTOFILE/write_VBF1Parked_RAW.root
- Do the actual replacement of the file if the info checks out
xrdcp write_VBF1Parked_RAW.root root://eoscms//eos/cms/store/t0temp/data/WMAT0Commissioning/VBF1Parked/RAW/v4/000/208/390/00000/4CB0301F-343C-E211-B4A0-5404A63886A9.root