Workflow Team Meeting - March 26th 4PM CERN time 10 AM FNAL TIME!!!!!

Vidyo Link

Attending

  • FNAL: Jen, Luis, Seangchan, John
  • US: Ajit
  • CERN: Julian, Andrew, Alan, Dima, JeanRoc

Personel

  • Luis to Colombia around May 1-10???? talk on the 6th - on vacation refuses to give mom's home phone number
  • Julian out over Easter Holidays March 28-April 4 - offline
  • Jen May be taking time off around Easter, not sure on dates yet
  • Seangchan off a few days March 28-April 4 Mon, Wed Fri???
  • Matteo is at CERN this week, Moriond next week

News

  • New EU Operator!
    • James Keaveney - Belgium
    • Haneol Lee - SNU Korea
    • And now Alex Van Spilbeeck (also from Belgium)

3 top issues effecting production

  1. The missing files mystery updates GGUS ticket 111932:
    • We have another clue: Jobs could be deleting a file when is written directly to /merged area.
    • Doing some tests.
    • Failures for Steps 2 & 3 requiring cloning
  2. Some ReDigi's need to be retried since the output got lost due to some misconfiguration. see elog
    • pdmvserv_SMP-Summer12DR53X-00019_00374_v0__150301_035649_1287 (closed out)
    • pdmvserv_SMP-Summer12DR53X-00018_00374_v0__150301_000335_5218 (completed)
    • pdmvserv_TOP-Summer12DR53X-00291_00374_v0__150301_000336_8892 (normal-archived)
    • pdmvserv_HIG-2019GEMUpg14DR-00093_00080_v0__150302_170159_9419 (normal-archived)
    • Can clone the first two. Can we invalidate and un-announce the second two?
      • Julian can give the blocks that are missing, Dima will go to Physics and see if what we have is good enough or if we should clone first. If they need to be cloned Jen will do so tomorrow.
  3. IN2P3 was failing SAM tests, so we were not submitting data there.

Site support

Opportunistic sites

  • Not reporting

Workflows

  • Found problem with recovery script, when you run it on workflows that have whitelists it ignores the whitelist when it makes the recovery workflows.
    • this still needs attention SeangChan wants to do it along with the clouseout script himself, he wants to port it to reqmgr2, we need to have a config ops can change at anytime but pull the rest into requst manager 2
    • Alan will talk to Julian tomorrow and look at it
  • WF's that are waiting to be assigned for several days. Reminder to check subscriptions first. These are all MC's so they shouldn't depend on inputs

ReDigi

  • Issues with jobs failing reading outputs for Step2 & 3, ACDC's are failing as well so we need to clone. This is happening at KIT and CNAF

TaskChains

  • Only 1 TC - announced today

miniaod's

  • nothing to report

Rereco

  • Parent Bug issue: https://cms-logbook.cern.ch/elog/Workflow+processing/19366
  • jobs were failing because of incorrect configuration in dbs bug, agents will be patched to fix issue. We are going to have to reject and clone workflows. If we don't need the others lets kill them.
  • Dima can you check to see if we need all these and let us know in hte elog? Yes

Store Results

MonteCarlo

  • two TSG-RunIIWinter15GS's with filter efficiency issues see elog
    • Manually closed-out, please announce them. requestor might want some feedback.
    • I think there are a couple more affected,
    • Ajit is sending back to requestors
  • BoogaBooga WF's are being killed, which is killing components, just keep restarting them

Agent Issues

Redeployment Plan

production SL6
FNAL CERN
cmsgwms-submit1 (up) vocms0308 (up)
cmsgwms-submit2 (ready to wake) vocms0309 (ready to wake)
cmssrv217 (up) vocms0310 (ready to wake)
cmssrv218 (up)  
cmssrv219 (up)  

  • Alan will work on old agent and update to new couches

RelVal Andrew

  • We have a big problem with separating workflows closing out when 2 WF's are using the same input dataset. SeangChan is working on it but there are lots of changes to be made. How far does this issue done? Before we announce logarchive and cleanup need to be done. for RelVal they really need the LogArchive so we wait so things move to completed.

  • is there a plan to inject the logfiles into EOS?

  • what is the status of the batches in reqmgr2? The disk deployment , there is a flag to requestmanager only, it needs to be tested in private VM first.

AOB

  • Ajit still does not have access to scripts reposory
-- JenniferAdelmanMcCarthy - 2015-03-18
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2015-03-26 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback