Workflow Team Meeting - April 30 4PM CERN, 9 FNAL time

Vidyo Link

Attending

  • FNAL: Jen, Matto, SeangChan
  • US: Ian, Ajit
  • CERN : Andrew, Julian, Alan, Dima
  • EU:

Personel

  • Luis to Colombia around May 1-15th, working remotely (very little)
    • John will be in charge of the T0 while Luis is gone
  • Jorge is in Colombia
  • Matteo will be at CERN May 4-14
  • Jen will be taking a 1/2 day May 1, and will not be available the weekends of May 2-3, May 16-17
  • May 1 is a EU Holiday

News - DIMA

  • TP workflows are getting late
    • lots of errors with exceeding maxRSS even with event splitting
    • new GenSim requests should be injected any time and they will be used to redo DigiReco.
      • ~1/2 the requests will be rerun from the start
      • For workflows still running, let them move to complete on their own, and then manually close them out
  • we have a bunch of workflows that are not closed, Force-complete if 90 or higher
    • specific requests by users. if they are longer than 3 wks old we force complete
    • Dima is working on making 90% the new threshold
  • Major Run2 DigiReco is delayed by ~2 weeks
    • Current estimate is May 8th

3 top issues effecting production

  • Patch to agent due to instablility issues - only on submit2 JobUpdater can't connect to socket Alan is discussing with Furruk and Brian, we have too many TCCP connections, when we hit that limit we see the failure. May be an issue because couch is making more connections and is leaving the connections open. It only happens to JobUpdater and only on submit2. If this is true we can restrict the # of connections on couch side. SeangChan added patch to retry. If it goes down during FNAL time, ping SeangChan so he can take a look.
  • now that high priority stuff is wrapping up we are getting good cycles on low priority workflows

Site support - John

  • it's quiet so we aren't breaking things

Workflows

  • New close out script and pages:
    • https://cmst2.web.cern.ch/cmst2/unified/assistance.html
    • https://cmst2.web.cern.ch/cmst2/unified/closeout.html
    • issues to discuss:
      • # files in dbs vs PhEDEx (dbsF dbsIF phdF) Are we now accounting for invalidated files automatically?
        • Julian, have you checked to see how JeanRoch is doing this? It seems like we suddenly have had more issues with this this week, I haven't had failures for this in a long time until this week.
        • JR : using the code from the old closeout script code with the invalid file filter in addition
        • Julian: Yes I have, invalid files are caused by other site-related issues.
        • Julian: Idea -> integrate issues pages with the db that is already in place (I can do that)
      • Are we cascading closeouts properly? I just looked and there are a fair number of ACDC's "abandoned" in complete where the parent workflow has moved on.
      • I'm still fuzzy on the differences between assistance and assistance-recovery
        • JR :
          • assistance is everything that is going to solve itself (missing file in dbs, phedex, acdc finishing, mss copy, ...) with time and automatically
          • assistance-recovery is, regardless of other checks performed, the available complete fraction is below threshold. We could set assistance those in this situation, but having an acdc already running. would that be less confusing ?
      • which version of WorkflowPercentage.py is correct for event based splitting workflows? Do we understand? I don't want to repeat the mistakes of the past
        • JR : what mistake of the past ?
        • the lumi fraction was high but event fraction low, you can end up with slow lumi's, so those are the ones that are processable so endup in the output dataset. So we don't want to bias our outputs. We are most processing small lumi's. We need to warn people about this when we announce. We don't want to do event based lumi's as a default.
  • JR : Lots of RunIISpring15Digi74 got in the game, progressing ok so far
  • JR : "lfn" for /GEN dataset. what should it be, why ?

ReDigi

TaskChains

  • Dave was complaining that they are doing something funny that can't be run at SanDiego, they are 2 step wf's that we need the 1st step to run 2nd so we need to be able to stage out to disk which is the issue at SanDiego, so if we have a TaskChain that only has one step, they could run there.
  • we can use the SD storage element so we could be use it like a T1, we just need to work it now.
  • Can we send MC clones and resubmit there? yes
  • We should send as much pure gen MC at them as we can
  • 3 task_B2G-RunIIWinter15wmLHE with maxRSS exceeded (only 1 job) https://cms-logbook.cern.ch/elog/Workflow+processing/20035

Redigi

  • things slowing down, and old low priority workflows in the system are finally getting cycles.
  • High Priority Upgrade and TP workflows are closing out with "good enough" statistics

miniaod's

  • nothing

Rereco

  • JR : Who can take care of the on-going cosmic reprocessing that somehow got lost in transition ?
    • Jen will run stuck requests script on them and then elog and find out from Andrew and Alan what they want us to do with them.

Store Results

  • nothing to report

MonteCarlo

  • JR : new upgrade gen-sim coming in, setting splitting to 25 events/lumi automatically

Agent Issues

Redeployment Plan

RelVal Andrew

  • What is the status of the github issues
    • separatioin of patch, ran tests it's OK, subscription satus changes, withn it is done, that subscription is done, not all the child subscriptions. We can't take out the patch or it will messup the database. Need to first test in testbed and then the test agents.
    • about maxmean value? what should we do, patch probably won't get in

L3 discussion - Ajit, Jean-Roch, Matteo

  • JR : How does the "Memory" parameter get passed along ?
  • JR : How do we catch stuck requests easily ?
  • JR : Should we abandon the redigi, mc separation and talk about workflows to be done. technically there is no separation anymore.
  • From Dave: Long tails: is that caused by the wider white list? Is it an issue? Again on the wider whitelist: if everything is running everywhere and if one site is affected by some issues, this will create problems to a large number of workflows, the ones running there. How can we deal with this?
  • Matteo: Is it possible using the T0 for processing?

Opportunistic Resources - Stefan

HLT

  • Matteo: did we succeed in running MC/ReDigi? Is the EOS permission issue solved?

SDSC

  • JR : 1Mevent/day observed, and going "down". is anything happening with the site ?

Automatic Assignment And Unified Software

AOB

Last Agent status

production SL6
FNAL CERN
cmsgwms-submit1 (ready to redeploy) vocms0308 (ready to wake)
cmsgwms-submit2 (up) vocms0309 (up)
cmssrv217 (ready to redeploy) vocms0310 (up)
cmssrv218 (ready to redeploy) vocms0311 (up)
cmssrv219 (ready to redeploy)  

-- JenniferAdelmanMcCarthy - 2015-04-29

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2015-04-30 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback