Workflow Team Meeting - May 29

Vidyo Link

Attending

  • FNAL - Jen, Luis, Dave, Oli, SeangChan
  • CERN - Julian

Personel

May 22 -> May 29 Xavier
May 29 -> June 5 Adli

News

  • acdc couch views need to be rebuilt
    • projected for this weekend still?
      • Friday - maybe
    • what are possible implications? What should we look out for?
      • we can not create ACDC's while it happens.
    • SeangChan will delete the old ACDC's whose original WF has been announced, and see how much that buys us and then they will recreate the view
  • Need people to look into timeouts of merge jobs. Merge jobs should be short, there is no reason they should be timing out, we have been blaming "network issues" but is there anything else that has changed over the last couple weeks that could be causing this problem? Jen is seeing it a LOT in Redigi and Luis is noting issues in StoreResults as well.

Site issues

  • Who put Caltech in drain? When you you put a site in drain you must e-log and file tickets, they have 4k idle nodes and nobody knows why they are in drain.

Xavier / Sara's Notes

Agent Issues

  • 201 and 85 still in drain for upgrades - how are we doing in updating our documentation on drain issues?
  • ErrorHandler crashing alot, hence the need for the acdc view rebuild.

Workflow issues

Store Results

  • Jen, Julian and Luis had a meeting last friday to discuss handover of store results. We will have another Meeting Fri May 30:
  • Turns out that Store Results is having the same issue with merge timeouts as Redigi is. Luis reported that WF's he ran with no issues several weeks ago are now having timeout issues, and was going to investigate further. Luis do you have an update?

MonteCarlo

  • Recovering a lot of old workflows, some of them are really huge (100K jobs or more) and last a while.
  • I need to be able to extend workflows, who can help me debug this? https://github.com/dmwm/WMCore/issues/5148

Redigi/Rereco

  • working my way through the list of WF's in complete. Most of them are due to timeouts, at FNAL we were blaming the timeouts on network issues, but I am seeing them across the board. we need to figure this out, it's killing us in latency to have to make 2-4 acdc's per workflow to get everything through.

RelVal

  • RelVal workflow assignment- Andrew's page
    • FWIW I'm agreeing with Dave, this sounds dangerous. We all know requestors can put "stupid things in" that could really break things bad. Having a bit of a buffer in there, may slow things down, but fast is not always best.
      • what if we have an approval requirement, like we do for transfer requests?
  • Dave, please do not move relval data
  • all logcollect jobs are still failing at FNAL: https://cmslogbook.cern.ch/elog/Workflow+processing/14030 and https://github.com/dmwm/WMCore/issues/5076
-- JenniferAdelmanMcCarthy - 28 May 2014
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r5 - 2014-05-29 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback