Workflow Team Meeting - Dec 17 4PM CERN, 9 FNAL time

Vidyo Link

Attending

  • FNAL: Jen, SeanChan, Gaston,Eliana
  • US: Ajit, Matteo
  • CERN : Dima, JeanRoc
  • Colombia: Julian

Personnel

  • Julian to Colombia Dec 14, and then contract ends, will be working remotely from Colombia until the First of Jan
    • Replacement coming Late Jan??? What do we know? Sounds like they will be at CERN Early Feb.
  • Jen Remote working Dec 24-Jan4
  • Gaston out Dec 22-Jan 8 - not have connection may read e-mail
  • Jorge out Dec 21-Jan 6
  • Eliana out Dec 28th to Jan 14th.
  • SeangChan: Couple Days around Christmas/New Years, exact days not yet determined
  • Matteo - Will be around, will be in Italy from Dec 21-Jan 7 but will be working remotely, will have good internet, availablity won't change just be on EU time
  • Alan - Going to Brazil Dec 21-Jan 21 will be working from Brazil Jan 14-20 - SeangChan has Alan's grandma's number
  • Dima - Will be around

News - Dima

  • Holiday Production next week! Digi/ReReco next week
    • 5Billion events in the system already
    • High Priority, Mulicore/multithreading - we are running it this way for T0
    • what do we need to do differently on the assignment sign, changes in the config, we need to specify the number of cores required
      • assignment is fine
      • When will they come? not sure yet
      • only real overhead is job splitting doesn't change
      • We can adjust all the parameters without breaking anything
      • We would be "using extra piolets" reserved that aren't actually used
      • If we are running out of memory,bump it up
      • if we are running out of time, not clear what to do
      • assignments will be using the standard assignment scripts
    • we need requirement son the assignment side
  • sort out issues with existing stuff, lots of ACDC's on rereco side2015C
  • rererco for everything of 76 release will go out today!
    • this will be large what can we use? it's multicore so can we use T2's or only T1's, not entirely sure which T2's are ready for Mulicore
      • which sites are truly ready for Mulicore, you can get it from the Factory - Ajit and Gaston
    • we need to redistribute raws - Unified should handle it, it needs to make sure it is distributed only to sites that can handle the Multicore
  • also want to inject HI - make sure to blacklist Fermilab for HI WF's

3 top issues effecting production

  • 218 sucked up a huge workflow and has been unstable ever since, it was put into drain on Friday
  • We got caught over the weekend with only 3 agents running at full power
    • Once we got more agents up, things began to recover
    • Why so few? We were in the middle of doing required system upgrades and then had agent become unstable
    • Brian B did a lot of digging and has been doing a fair amount of bug fixes
      • where are we at on patches? SeangChan have you reviewed all the changes...
  • About documentation:

  • Julian, turn in all your code.. don't change anything except to fix things that don't work for the next week! We won't know what doesn't work, what documentation is missing if we don't go through it all!
  • how long should it take for high priority to overtake slots? 1-2 days

Site support - Gaston

* Into the Waiting Room: T2_IN_TIFR

* Out the Waiting Room: T2_RU_INR

Sites in Waiting Room: T2_TH_CUNSTDA, T2_RU_SINP, T2_IN_TIFR.

Sites in Morgue: T2_PL_Warsaw, T2_MY_UPM_BIRUNI, T2_TR_METU, T2_RU_RRC_KI, T2_RU_ITEP, T2_RU_PNPI

Transfers - Jorge

  • Jorge is working on stuck FNAL transfers - where are we? FNAL still red?
  • TMDB has been updated with the correct size of the files, ~60K files, so now the transfers are working.

Workflows

  • Multicore recovery
    • there was talk on a thread about running the Christmas Production Multicore, and what we would have to do to recover missing data. ACDC/Recovery script etc. The thread ended with "Let's talk this out in person at CERN this week." An I haven't heard a thing about it since. Anybody know what the solution was? This is exactly what I am worried about without having a WF team leader/person at CERN to sit in and find out what is going on we are going to be way behind the curve here.

ReDigi

TaskChains

StepChain

Rereco

Store Results

  • check documentation

MonteCarlo

Agent Issues

  • 218 still unstable (will be redeployed before it comes back to life)
  • 219 can be taken out of drain when the load gets low

Agent redeployment

  • Next production stable release aimed at Feb/2016
  • Ready to be redeployed: submit2, vocms0311
  • cmssrv218 and 219 are in drain (Workqueues overloaded).
production SL6
FNAL CERN
cmsgwms-submit1 (up) vocms0308 (up)
cmsgwms-submit2 (ready to redeploy) vocms0309 (up)
cmssrv217 (up) vocms0310 (up)
cmssrv218 (drain - overloaded) vocms0311 (ready to redeploy)
cmssrv219 (drain - overloaded) vocms0304 (on HLT tests)
  vocms0303 (up / highprio)

RelVal Andrew

  • haven't heard from him in ages.... is he OK?? He's connected and OK! has nothing to report!

L3 discussion - Ajit, Jean-Roch, Matteo

Opportunistic Resources

Automatic Assignment And Unified Software

  • We need documenation!!!!!!! Matteo is working on it, will continue to look at it.

AOB

* new WMAgent version, need to redeploy HP and RelVal agents

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2015-12-17 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback