Workflow Team Meeting - May 29

Vidyo Link

Attending

  • FNAL - Jen, Luis, Dave, Oli, SeangChan, John
  • CERN - Julian, Andrew L

Personel

May 22 -> May 29 Xavier
May 29 -> June 5 Adli

News

  • acdc couch views need to be rebuilt
    • projected for this weekend still?
      • Friday - maybe, We will be informed when this happens and when we are clear again.
    • what are possible implications? What should we look out for?
      • we can not create ACDC's while it happens.
    • SeangChan will delete the old ACDC's whose original WF has been announced, and see how much that buys us and then they will recreate the view
  • Need people to look into timeouts of merge jobs. Merge jobs should be short, there is no reason they should be timing out, we have been blaming "network issues" but is there anything else that has changed over the last couple weeks that could be causing this problem? Jen is seeing it a LOT in Redigi and Luis is noting issues in StoreResults as well.

Site issues

  • Who put Caltech in drain? When you you put a site in drain you must e-log and file tickets, they have 4k idle nodes and nobody knows why they are in drain.
  • Drain list, sites ready to move out:
       * 3 other sites : Caltch, ASGC, T2_Belgium_UCL
    Caltech
    ASGC
    UCL
    

Xavier / Sara's Notes

Agent Issues

  • 201 and 85 still in drain for upgrades - how are we doing in updating our documentation on drain issues?
    • Guys I need a green light to redeploy vocms201 and vocms85: https://cmslogbook.cern.ch/elog/Workflow+processing/14770
    • I will redeploy it on Friday unless someone tells me not to. --> Workflows with missing information. --> More work for everyone.
    • SeangChan would like to move to couch 1.5 for better stability. Fixes Cert problem as well.
  • ErrorHandler crashing alot, hence the need for the acdc view rebuild.

Workflow issues

Store Results

  • Jen, Julian and Luis had a meeting last friday to discuss handover of store results. We will have another Meeting Fri May 30, 5 CERN time:
  • Turns out that Store Results is having the same issue with merge timeouts as Redigi is. Luis reported that WF's he ran with no issues several weeks ago are now having timeout issues, and was going to investigate further. Luis do you have an update?

MonteCarlo

  • Recovering a lot of old workflows, some of them are really huge (100K jobs or more) and last a while.
  • I need to be able to extend workflows, who can help me debug this? https://github.com/dmwm/WMCore/issues/5148

Redigi/Rereco

  • working my way through the list of WF's in complete. Most of them are due to timeouts, at FNAL we were blaming the timeouts on network issues, but I am seeing them across the board. we need to figure this out, it's killing us in latency to have to make 2-4 acdc's per workflow to get everything through. * Dave will post to Comp-ops to have the other T1's look at their network issues

RelVal

  • RelVal workflow assignment- Andrew's page
    • FWIW I'm agreeing with Dave, this sounds dangerous. We all know requestors can put "stupid things in" that could really break things bad. Having a bit of a buffer in there, may slow things down, but fast is not always best.
      • what if we have an approval requirement, like we do for transfer requests?
    • agreed that this isn't going to work.
  • Dave, please do not move relval data
  • all logcollect jobs are still failing at FNAL: https://cmslogbook.cern.ch/elog/Workflow+processing/14030 and https://github.com/dmwm/WMCore/issues/5076
    • patch is in github but it hasn't been applied in the agent yet.
    • patches need to be applied to the relval agent. Julian will patch them tomorrow.

AOB

  • closeout procedure and what are we doing with the MSS subscription
    • Now that we are saving more stuff at CERN it is increasing our latency
    • change the code so that the subscription has been made - Julian will change the code

  • SL6 - where do we stand - so far we've made the RPM's so we can deploy at CERN and FNAL machines, dependencies have been solved, Krista is getting us a machine
    • Get a bunch of SL6 machines, give it a team of "global" and start moving to it over the course of the next few months. Most of the worker nodes are SL6, we just need to move the agents. We need to get Condor functioning at FNAL. at CERN they need to connect to the condor pool
    • SeangChan will attend the Burt and Oli show today so we can discuss it with Krista.

-- JenniferAdelmanMcCarthy - 28 May 2014

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2014-05-29 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback