Workflow Team Meeting - May 28 4PM CERN, 9 FNAL time

Vidyo Link


  • FNAL: Jen, Jorge, Luis, SeangChan, John
  • US: Ajit, Dima, Ian, JeanRoc
  • CERN : Julian, Andrew, Alan
  • EU:


  • Jen unavailable from noon Fri-Monday morning
  • This will be Ian's last week, he is switching to Atlas * Jen needs to start bugging people about more US operators

News - DIMA

  • Need to invalidate ~200 requests. Should be done at PPD side. They will reset and resubmit.
    • was listed last night, already done. New work has not been resubmitted.
    • we have no way of moving requests from announced to rejected in the ReqMgr
    • state can be changed manually if really needed, but as written announced-archived is a final resting place.
    • We do not have a timeline for the replacements.
  • We may have some urgent requests for first paper. Probably nothing significant.
    • we will need some MC on first data, it will be small
  • WF's that are being assigned to T2_CH_CERN can have jobs sent to AI and HLT, it's OK for HLT but not for AI

3 top issues effecting production

  • Ongoing issues with workflows being stuck in acquired for long periods of time.
    • Alan and Seangchan solved it! - 70K jobs running (100K in the whole pool)
    • last week we had 2 backend nodes failing, and they were removed
    • Initially errors making PhEDEx calls, seems error is gone but WF not moving because all acquired WF was assigned to T2_CH_CERN, and there was a bug in the multiprocessing code that was blocking all these jobs from going in.
    • we need to revisit threshold settings, SeangChan, Luis and Alan will work together on this.
  • Global WQ not updating block location (thinks stuff is only at T0_CH_CERN)
    • Symptoms: Workflows that have 0 errors but < 100% lumis (in all datasets). Stuff stuck in acquired and running-closed.
    • Alan is debugging it.
  • workflows stuck close to being completed and needing to be force-completed, this is happening more often than it used to, are we being impatient or is this related to the agents getting stuck problem. Is force completing things causing issues?
    • Julian: IMO, not an easy answer to that. We already discussed this.
    • change behavior of force-complete so it doesn't wait for log-collect and cleanup, and then we can move things along and when they are done it will move to archived

Site support - John



  • RunIISpring15DR74: 495 assig 27 acq 336 runn, 16 comp, 58 ann (1130 at the beggining) so around 40% through.



Store Results

  • new store results popped into the system this week: T3_US_Cornell


Agent Issues

Redeployment Plan

RelVal Andrew

  • Did you look into the PhEDEx instance - no he will try to look next week.
  • is there a github issue for why we still have jobs running when a workflow is aborted? not sure there is an issue yet maybe we need to look

L3 discussion - Ajit, Jean-Roch, Matteo

Opportunistic Resources - Stefan



Automatic Assignment And Unified Software


Last Agent status

  • Next plan: balance fnal and cern agents (drain vocms0311 - wake one of the cmssrv's)

production SL6
cmsgwms-submit1 (up) vocms0308 (up)
cmsgwms-submit2 (up) vocms0309 (up)
cmssrv217 (ready to wake) vocms0310 (up)
cmssrv218 (ready to wake) vocms0311 (drain)
cmssrv219 (ready to wake)  

-- JenniferAdelmanMcCarthy - 2015-05-27-- JenniferAdelmanMcCarthy - 2015-05-27

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2015-05-28 - JenniferAdelmanMcCarthy
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback