Workflow Team Meeting - May 29
Attending
- FNAL - Jen, Luis, Dave, Oli, SeangChan
- CERN - Julian
Personel
May 22 -> May 29 |
Xavier |
May 29 -> June 5 |
Adli |
News
- acdc couch views need to be rebuilt
- projected for this weekend still?
- what are possible implications? What should we look out for?
- we can not create ACDC's while it happens.
- SeangChan will delete the old ACDC's whose original WF has been announced, and see how much that buys us and then they will recreate the view
- Need people to look into timeouts of merge jobs. Merge jobs should be short, there is no reason they should be timing out, we have been blaming "network issues" but is there anything else that has changed over the last couple weeks that could be causing this problem? Jen is seeing it a LOT in Redigi and Luis is noting issues in StoreResults as well.
Site issues
- Who put Caltech in drain? When you you put a site in drain you must e-log and file tickets, they have 4k idle nodes and nobody knows why they are in drain.
Xavier / Sara's Notes
Agent Issues
- 201 and 85 still in drain for upgrades - how are we doing in updating our documentation on drain issues?
- ErrorHandler crashing alot, hence the need for the acdc view rebuild.
Workflow issues
Store Results
- Jen, Julian and Luis had a meeting last friday to discuss handover of store results. We will have another Meeting Fri May 30:
- Turns out that Store Results is having the same issue with merge timeouts as Redigi is. Luis reported that WF's he ran with no issues several weeks ago are now having timeout issues, and was going to investigate further. Luis do you have an update?
- Recovering a lot of old workflows, some of them are really huge (100K jobs or more) and last a while.
- I need to be able to extend workflows, who can help me debug this? https://github.com/dmwm/WMCore/issues/5148
Redigi/Rereco
- working my way through the list of WF's in complete. Most of them are due to timeouts, at FNAL we were blaming the timeouts on network issues, but I am seeing them across the board. we need to figure this out, it's killing us in latency to have to make 2-4 acdc's per workflow to get everything through.
- RelVal workflow assignment- Andrew's page
- FWIW I'm agreeing with Dave, this sounds dangerous. We all know requestors can put "stupid things in" that could really break things bad. Having a bit of a buffer in there, may slow things down, but fast is not always best.
- what if we have an approval requirement, like we do for transfer requests?
- Dave, please do not move relval data
- all logcollect jobs are still failing at FNAL: https://cmslogbook.cern.ch/elog/Workflow+processing/14030
and https://github.com/dmwm/WMCore/issues/5076
--
JenniferAdelmanMcCarthy - 28 May 2014
This topic: CMSPublic
> CompOps >
CompOpsWorkflowTeam >
WorkflowTeamMeeting > WorkflowTeamMeeting20140529
Topic revision: r5 - 2014-05-29 - JenniferAdelmanMcCarthy