Workflow Team Meeting - May 15

Vidyo Link


  • Julian - CERN
  • Andrew
  • Dave, SeangChan and Jen
  • Sara and Jasper


May 8 --> May 15 Jasper
May 15 -> May 22 Sara


  • Do we need to change the closeout script requirements for PhEDEx? - Dave this will be your discussion to lead
    • Now that we are Disk/Tape separated we are not always going to subscribe the data to tape, our current requirement is that we have an approved subscription
    • over the course of the next week FNAL is updating the dcache for tape so we will not be approving tape subscriptions so either we don't use that requirement or we manually close out the WF;s or we let them sit for a week.
    • we can change the requirement that it only has been made
    • we currently check
      • % done -> PhEDEx -> Health of dataset
      • change to % done -> Health of Dataset -> PhEDEx
      • for the next week we will do new order and close out FNAL WF's manually

  • Status of syncing up and recovery after the WMStats problems last week - SeangChan this will be your part of the discussion to lead
    • add any missing permanent documents by hand:
      • the definitive list of missing documents yet to be generated
      • we copy them by hand if they exist in the current database, otherwise we re-generate them from ReqMgr DB (with loss of some document information).
    • Sara will check the running WF's to see if their status matches
  • UPG not "highest priority" anymore, but still High Priority.

Jasper's notes

Agent Issues

  • Jobs couldn't be aborted:
    • Alan & Seangchan applied a patch, and now it's working Elog discussion
General network problem at CERN: Affected GlideIn Collector (generalized job drop on Friday). cmssrv98 and cmssrv112: Host certificate expired, and couch replication issues Elog discussion
  • Step0 agent's hack:
    • When request ask for more events than what the input LHE file has
    • Producing 'Products Unmergeable Error'
    • If the hack is applied, empty files are ignored and events are merged.
    • The hack is not compatible with most recent WMStats version.
    • this is in fact an error in CMSSW so this needs to be fixed there not hacked around.
    • Dave will followup with offline and get this fixed

Workflow issues

  • Workflow status mismatch 14488
    • I only checked the Redigi - Julian do you check the MC for this ?
    • Julian didn't found anything weird - usual hiccups and missing job information
  • Redigi -
    • EXO-Summer12DR53X - WF's that failed 100% due to input pileup datasets being deleted - 65 of them!
    • Yes I know I'm way behind going through the rest I'm tackling the big issues first then will drill down into the ones that need to be looked at one by one.
    • Redigi's in failed state
      • Let's discuss the question here : which is the origin of this failed status? We don't see it in the GENSIM/FSIM.
      • why do we keep ending up with "good workflows" stuck in failed and WF's like the EXO-Summer12DR53X that failed out miserably move through to complete.
  • Julian's question: What is this "StoreResult" thing that we are supposed to take over?
    • Luis is finishing up the twiki to explain everything and will turn it over to us.

Site issues - John Please fill this in!

Andrew's questions

  • Last week's meeting notes stated that Andrew would be back today! Welcome back.. so what questions do you have for us wink


  • Nada
-- JenniferAdelmanMcCarthy - 15 May 2014

This topic: CMSPublic > CompOps > CompOpsWorkflowTeam > WorkflowTeamMeeting > WorkflowTeamMeeting20140515
Topic revision: r4 - 2014-05-15 - JenniferAdelmanMcCarthy
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback