https://indico.cern.ch/conferenceDisplay.py?confId=254683
Attendence
Personel
Nov 26 --> Dec 3 Sunil
Dec 3 --> Dec 10 Sara
Review of last weeks issues
- 2011 legacy reprocessing:
- Recovery very close to being complete, on the last workflow, waiting for recovery to finish. Jobs still running as of Mon night.
- Where are we with the stuck Workflow problems?
- Problem with log file access, especially when there is very little time between jobs running and access to logfiles,
- suggestion to first store the logarchives on EOS and then on castor.
- How do we give users access to these files? will they be able to get them themselves or will the workflow team have to fetch them?
Agent Issues
Site Issues that affected workflows
Workflow Issues
- Drop of running jobs on tuesday-wednesday: GlideIn Front-end.
- Retrieving logs before force-complete.
MonteCarlo
- EXO-Fall13 with merge failures 11474, Kill-and-clone policy.
- BTV-Fall13 batch without failing info 11508
- High priority WFs finished: 11428
- Highest priority WF stuck: 11489 ACDC completed successfuly but still at 34%
Reprocessing
- franzoni_Fall53_2011A_Jet_Run2011A-v1_Prio1_5312p1_130916_235201_2576 - still running recovery, the rest of the ReReco WF's are done
- pdmvserv_SMP-Fall11R4-00002_T1_US_FNAL_MSS_00006_v0__131113_125318_6858
- Monday's round of tests with Burt showed it is in fact a site issue and Burt is working on it.
- pdmvserv_EGM-UpgradePhase1Age1Kdr61SLHCx-00002_T1_US_FNAL_MSS_00002_v0__131125_160358_1243 - performance failures 92% so I will run ACDC to see if we get more
The Andrew's Question
--
JenniferAdelmanMcCarthy - 03 Dec 2013