! Tier-0 Operator Manual

Document Description

This is intended to be a mostly complete reference of procedures a Tier-0 operator will need to perform as part of his duties. Please try to write down any new procedure that comes up in day to day operations.

Error Recovery

Corrupted Unmerged files

Symptom

Paused merge jobs with a failure of type FileReadError with messages similar to:

An exception of category 'FileReadError' occurred while
   [0] Reading branch EventAuxiliary
Exception Message:
Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
basket: has fNevBuf=0 but fEntryOffset=0, pos=1254725213, len=2179, fNbytes=0, fObjlen=0, trying to repair

Debugging

  1. Verify the integrity of the file -- TODO: Expand on this

Solution

Environment

LFN=/store/t0temp/data/WMAT0Commissioning/VBF1Parked/RAW/v4/000/208/390/00000/4CB0301F-343C-E211-B4A0-5404A63886A9.root # Problematic unmerged file
WORKFLOW=Repack_Run208390_StreamA # Affected workflow
COUCHURL=vocms15.cern.ch:5984 # Couch server for the agent

  1. Find the job that produced the unmerged file with the following url, extract the job id and retry count by looking at the id field in the response.
http://$COUCHURL/wmagent_jobdump%2Ffwjrs/_design/FWJRDump/_view/jobsByOutputLFN?key=[%22$WORKFLOW%22,%20%22$LFN%22]
Example output:
{"total_rows":67735,"offset":37343,"rows":[
{"id":"249771-0","key":["Repack_Run208390_StreamA","/store/t0temp/data/WMAT0Commissioning/VBF1Parked/RAW/v4/000/208/390/00000/4CB0301F-343C-E211-B4A0-5404A63886A9.root"],"value":249771}
]}

JOBID=249771
RETRYCOUNT=0

  1. Get the LogArch from the job that produced the unmerged file, with the job id from the previous step go to the following url:
http://$COUCHURL/wmagent_jobdump%2Ffwjrs/_design/FWJRDump/_view/logArchivesByJobID?key=[$JOBID, $RETRYCOUNT]

Example output:

{"total_rows":12143,"offset":561,"rows":[
{"id":"249771-0","key":[249771,0],"value":{"lfn":"/store/t0temp/data/logs/prod/2012/12/2/Repack_Run208390_StreamA/Repack/0000/0/Repack-ffc1c5de-3c33-11e2-844d-003048caaace-0-logArchive.tar.gz","retrycount":0,"location":"T2_CH_CERN"}}
]}
LOGARCH=/store/t0temp/data/logs/prod/2012/12/2/Repack_Run208390_StreamA/Repack/0000/0/Repack-ffc1c5de-3c33-11e2-844d-003048caaace-0-logArchive.tar.gz

  1. Stage the logArch and get the PSet.py and PSet.pkl, e.g.

lxplus310:example> cmsStage /store/t0temp/data/logs/prod/2012/12/2/Repack_Run208390_StreamA/Repack/0000/0/Repack-ffc1c5de-3c33-11e2-844d-003048caaace-0-logArchive.tar.gz .
lxplus310:example> tar -xzf Repack-ffc1c5de-3c33-11e2-844d-003048caaace-0-logArchive.tar.gz 
lxplus310:example> ls
cmsRun1  Repack-ffc1c5de-3c33-11e2-844d-003048caaace-0-logArchive.tar.gz  stageOut1
lxplus310:example> ls cmsRun1/
cmsRun1-stderr.log  cmsRun1-stdout.log   FrameworkJobReport.xml   PSet.pkl  PSet.py  Report.pkl  scramOutput.log

  1. Deploy the appropriate CMSSW area and cmsRun the PSet.py, Note: You have to be cmsprod to reproduce Repack and Express jobs, e.g.

lxplus310:example> scram project CMSSW_5_2_7
lxplus310:example> cd CMSSW_5_2_7/src/
lxplus310:src> cmsenv
lxplus310:src> cmsRun -e PSet.py &

  1. Get the appropriate output file you want to replace, in this example is from the VBF1Parked PD. So we want write_VBF1Parked_RAW.root

  1. Do basic content checks and compare against what the FWJR says about the original file:
FWJR information:
http://$COUCHURL/wmagent_jobdump%2Ffwjrs/$JOBID-RETRYCOUNT
[{"branch_hash":"07d89c50739e492bf528d39b44aeec85","user_dn":null,"lfn":"/store/t0temp/data/WMAT0Commissioning/VBF1Parked/RAW/v4/000/208/390/00000/4CB0301F-343C-E211-B4A0-5404A63886A9.root","dataset":{"applicationName":"cmsRun","applicationVersion":"CMSSW_5_2_7","processedDataset":"WMAT0Commissioning-v4","dataTier":"RAW","primaryDataset":"VBF1Parked"},"InputPFN":"/pool/lsf/cmsprod/337168096/job/WMTaskSpace/cmsRun1/write_VBF1Parked_RAW.root","checksums":{"adler32":"fcb9bea9","cksum":"1744879610"},"guid":"4CB0301F-343C-E211-B4A0-5404A63886A9","size":1953063303,"acquisitionEra":null,"configURL":"None;;None;;None","location":"T2_CH_CERN","async_dest":null,"events":4380,"validStatus":"PRODUCTION","ouput_module_class":"PoolOutputModule","globalTag":"NOTSET","custodialSite":null,"pfn":"/pool/lsf/cmsprod/337168096/job/WMTaskSpace/cmsRun1/write_VBF1Parked_RAW.root","catalog":"","module_label":"write_VBF1Parked_RAW","inputPath":null,"StageOutCommand":"rfcp-CERN","runs":{"208390":[206]},"OutputPFN":"root://eoscms//eos/cms/store/t0temp/data/WMAT0Commissioning/VBF1Parked/RAW/v4/000/208/390/00000/4CB0301F-343C-E211-B4A0-5404A63886A9.root","user_vogroup":"DEFAULT","user_vorole":"DEFAULT","processingVer":null}]

See EVENTS=4380, RUNLUMIS={"208390":[206]}

Replacement file information:
edmFileUtil -e file://$PATHTOFILE/write_VBF1Parked_RAW.root 

  1. Do the actual replacement of the file if the info checks out
xrdcp write_VBF1Parked_RAW.root root://eoscms//eos/cms/store/t0temp/data/WMAT0Commissioning/VBF1Parked/RAW/v4/000/208/390/00000/4CB0301F-343C-E211-B4A0-5404A63886A9.root

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2012-12-14 - DiegoBallesterosVillamizar
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback