https://indico.cern.ch/conferenceDisplay.py?confId=254683
Attendence
John,
SeangChan, Luis, Jen, Dave - FNAL
Julian, Andrew, Adli- CERN
Personel
- Nov 26 --> Dec 3 Sunil
- Dec 3 --> Dec 10 Sara
Review of last weeks issues
- 2011 legacy reprocessing:
- Recovery very close to being complete, on the last workflow, waiting for recovery to finish. Jobs still running as of Mon night.
- Where are we with the stuck Workflow problems?
- 40 stuck yesterday 15 today
- Luis restarted some stuck components,
- DBS component was restarted and that
- Problem with couch in the MC agents, there is a TaskArchive query that is taking 24 hrs and crashing, it hasn't
- when can we shut down the agent, and hand over to Dirk, we will give Dirk 216 the next time it crashes we wil
- Problem with log file access, especially when there is very little time between jobs running and access to logfiles,
- suggestion to first store the logarchives on EOS and then on castor. * we don't change physical location so we can keep track of where they are for 1wk to 1 month - Dave is following up on this issue
- How do we give users access to these files? will they be able to get them themselves or will the workflow team have to fetch them?
- T2_CH_CERN_HLT & AI - most likely a site issue John will look into this. The WF's that are stuck due to this issue are all running on 227 which doesn't read from the drain list so the fact that it is in drain isn't an issue. It's an actual site issue.
- low latency agent
Agent Issues
Site Issues that affected workflows
- T2_UK_Bristol - giving warnings of some jobs on production view of dashboard. Unscheduled Cream CE down, unscheduled downtime John will see how long they are down and then decide if they need to be in drain. They should be out of downtime in a couple hours.
Workflow Issues
- Drop of running jobs on tuesday-wednesday: GlideIn Front-end.
- Retrieving logs before force-complete.
MonteCarlo
- EXO-Fall13 with merge failures 11474, Kill-and-clone policy.
- BTV-Fall13 batch without failing info 11508
- High priority WFs finished: 11428
- Highest priority WF stuck: 11489 ACDC completed successfuly but still at 34%
Reprocessing
- franzoni_Fall53_2011A_Jet_Run2011A-v1_Prio1_5312p1_130916_235201_2576 - still running recovery, the rest of the ReReco WF's are done
- pdmvserv_SMP-Fall11R4-00002_T1_US_FNAL_MSS_00006_v0__131113_125318_6858 * Monday's round of tests with Burt showed it is in fact a site issue and Burt is working on it. * still tracking down the issue so I will run another round
- pdmvserv_EGM-UpgradePhase1Age1Kdr61SLHCx-00002_T1_US_FNAL_MSS_00002_v0__131125_160358_1243 - performance failures 92% so I will run ACDC to see if we get more
The Andrew's Question's
- Asked by requesters what causes the memory limit of workflows RSS limit for heavy ion
- we do have sites that we could bump it up a little, limit is set by the hardware limit of most sites
- they want 4 GB I don't think we have any sites that have that kind of limit
- RE priority of agents
- we made a ticket with problems of PhEDEx injector, 400 error: https://github.com/dmwm/WMCore/issues/4863
--
JenniferAdelmanMcCarthy - 03 Dec 2013
This topic: CMSPublic
> CompOps >
CompOpsWorkflowTeam >
WorkflowTeamMeeting > WorkflowTeamMeeting20131202
Topic revision: r5 - 2013-12-03 - AndrewLevin