Workflow Team Meeting - Jun 5

Vidyo Link

Attending

  • Luis, SeanChan, Jen - FNAL Dave on phone, John
  • Ajit
  • Julian - CERN

Personel

May 29 -> June 5 Adli
June 5 --> June 12 Sara

News

  • Reminder to all Operators that we need to keep on top of "stuck" workflows. Please check daily for WF's that are are have no jobs running but are sitting in running-closed as well as Vincenczo's list
    • Workqueue on 234 is plugged up we need to get things moving on it again
    • 85 is in a similar state, complaining of duplicate directories
  • What happened to our daily e-mails? Are we keeping on top of restarting components and checking that all agents are full?
  • extended workflows - SeangChan has a patch should it be pushed. Julian committed it. Alan and Deigo can deploy it.
    • deploy in testbed and test it

Site issues

  • To remind everyone, the clouds are accessible from the production pool and you just have to include in your site whitelist
    • T2_CH_CERN, T2_CH_CERN_HLT, T2_CH_CERN_T0, T2_CH_CERN_AI to use all CERN resources. T2_CH_CERN is not enough though
      • AI only for test so leave out of the list
    • Management is watching...
    • HLT is about 4k cores and AI/T0 is about 5k cores.
  • Can we assign tape requests to FNAL again?
  • T0_CH_CERN has been added to our list and prodction status is on, just need to update thresholds
  • Checking sites in drain:
    • Nothing new - New ones in Morgue - are marked as down and we will keep them as down and we won't use them until further notice.
    • 2 sites are OK now Tiwan an METU John will keep an eye on them and let us know
    • there is a new metric/ggus in the site readiness view that will list all tickets open for the site.

Adli's Notes

Agent Issues

  • This then brings up another question, do we have enough WMAgents to feed all resources. Here is a little overview:
    • T1 leveL: 25k slots
    • T2+T3 level: 30k slots
    • CERN: T2_CH_CERN (4k) + T2_CH_CERN_HLT (4k) + T2_CH_CERN_T0/AI (5k) + 13k
      • total ~70k slots

  • we have agents for:
    • 2 active MC agents
      • vocms216: 20k running job limit
      • vocms235: 20k running job limit
    • 1 MC agent in drain
      • vocms201: 20k running job limit
    • 2 active high priority digi-reco agents
      • cmssrv112: 12k running job limit
      • cmssrv98: 12k running job limit
    • 3 active redigi agents:
      • vocms234: 20k running job limit
      • vocms202: 20k running job limit - in drain
      • vocms85: 14k running job limit
      • which means we have enough capacity for redigi but not enough for MC. This could be a plan, to be discussed in the workflow team meeting tomorrow:
  • Oli's reccomendations
    • leave vocms85 in the redigi team
    • retask vocms227 to be a MC agent - this is a virtual machine, so can't be retasked
    • check if vocms237 still needs to be used as step0 agent or if it can be (temporarily) used for MC production - probably too small.. but it could be used
    • Julian will talk to CERN IT to try to get another box or two so that we can run at full speed.
    • Julian is concerned that it is a glidin pileup not an agent issue.
  • Note: vocms234 hasn't get any jobs running in the last few days! How can we check if everything is ok?

Workflow issues

  • merge jobs are not going in fast enough and are pending. Julian is looking
    • merge jobs should run before anything else.. period...
  • SeangChan's list of Stuck WFs 14901
    • Julian already looked at the MonteCarlo
    • Jen will look at the Redigi
  • WmStats shows that a lot of jobs are queued but not yet pending. Why don't we have more jobs pending?
    • Julian will give SeangChan a list of WF's to look at.
      • Sometimes the location isn't yet resolved, so they are not yet submitted
        • are SE names mapping properly?
      • Resources not available, right now we have resources available though

Store Results

  • handoff from Luis to Julian done on Tues, Julian will run first 3 wks.. then Jen will give it a try
    • Julian received a couple tickets and will try to get things running tomorrow

MonteCarlo

  • A few things about request priorities, questions about how many jobs are queued inside the agent. HOw long does it take for the job to go from queued to pending/running

Redigi/Rereco

RelVal

  • no Andrew this week

AOB

  • SeangChan can not access/restart RelVal agent 142, and 113 and 174 are no longer used. Workqueue manager needs to be restarted on 174. It's Alan's agent and he needs to take care of it.
-- JenniferAdelmanMcCarthy - 05 Jun 2014
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2014-06-05 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback