%TOC{title="Workflow Team Meeting July 15, 2013"}

INDICO Link: https://indico.cern.ch/conferenceDisplay.py?confId=

Attending

John, Jen,Edgar, Andrew Xaviar, Yuyi

Personel

Jul 9 -->Jul 16 Sunil
Jul 16 --> Jul 23 Xavier
  • Jen will be taking some vacation time in Mid August exact dates TBD and will be taking at least 1/2 day on the July 26th

Issues last week

  • Over weekend couch replication went down on 216 causing MC production to come to a halt
  • something wierd going on with workflows on 98 & 112 could be related to the couch replication
  • Diego was able to fix Monday morning, how hard is it? Do we need Diego/Seanchan to do this or is this something operations should do? We can't be loosing an entire weekend of processing on a regular basis * Problems with updates to the closeout script - fix was made but not turned in. It is in now and should work.
    • Jen will update and make sure it is running properly
  • closeout script was marking WF's as duplicates that didn't have duplicate issues

  • Jacob and Jen spent Friday debugging and verifying that it is OK to close out ReDigi WF's that "only failed duplication"
    • Edgar will check to see if the input dataset is broken, if it is he will make sure we are closing it out.
  • this issue also is effecting Step0, Diego fixed the logic for the Step0 but ReDigi is still Broken

Site Issues

  • Job submission on HammerCloud is always failing on the crab -submit because the ssh bridge crab uses to conect to itself was broken.
    • It creates a socked file on /tmp and it was not removed after a connection close, thus the new connections couldn't start.
    • should be fixed now. Let's keep an eye on it.

  • T2_EE_Estonia frequently shows CRITICAL SAM-SRM (org.cms.SRM-VOGet) for some hours, then OK again for a short period of time, then again CRITICAL.
    • Problematic are org.cms.SRM-VOPut and org.cms.SRM-VOGet.
    • No action taken since policy did not say to do so.
    • 97% of jobSuccess (MC production)
    • we will keep it up for now until we see things fail again.

  • Jobs failing at T2_FI_HIP due to CMSSW
    • installation of new release undergoing
    • problem solved (Jul 11)

  • Review procedure to add an opportunistic site
  • SSB - how to add an opportunistic site
    • doesn't have a CE and that doesn't publishes to BDII
    • like a cloud or like CERN_HLT. ask Andrea Sciaba
  • Sam tests fail but still OK for production
    • we need the sub-set of tests that we really need to look at in order for production to happen.
    • this has already been tested when we moved to hammer cloud.
      • Andrew knows the twiki with the full list of tests and will send it around.
      • Edgar will look at the list and mark which ones production is worried about
  • Sites should be updating sites in siteDB, and they need to be updated in pledge view and Edgar maintains the pledge view.
    • Edgar will try to figure out how it is done again, and then tell John how to do it and it will become the site support teams job.

Agents

  • 201 is taking a long time to drain
  • recently there is a big problem with CMS log file size, there are 30-40 blocks in DBS2 and not in DBS3 that are dated from June 30
    • there are a few issues with the WMAgent.
    • how do we find the problem? How are we detecting it and how do we decide who to contact on it?
    • Agent is working fine or not. The problem is that the block is not being inserted into DBS3 but the Agent is not flagging it and letting somebody to know to look at it.
    • open a github issue
    • We need a flag

Workflows

  • Continuing work with Resubmission script
  • Need to address Step0/1 issues
  • Upgrade Workflows - why did they take so long? we need a post mortum

IEEE Paper

Draft Outline #1

  • Introduction (Why we need to run so much simulations, why we need to do a rereconstruction of the data) (Edgar/Jen)
  • a brief discussion of what the different types of workflows are, and how they are processed differently (Diego/Jen/Edgar)
  • monitoring for T1 & T2 sites(Diego/Jen/Edgar)
  • How we ran prior to 2011
    • ProdAgent vs WMAgent ( Diego/Alan) (Focus on differences and improvements)
    • Reprocessing and Production (Jen/Xavier) (How this was handled with ProdAgent and why the need to move to another framework
  • How we ran with WMAgent (after 2011)
    • WMAgent /ReqMgr/Workqueue (Diego/Edgar/Alan) General comment on how it works
    • PREP/ReqmG Interaction (Vincenzo?)
    • Organization of the workflow team and operations around it (Edgar)
  • Achievements
    • Events reconstructed (L3s)
    • Usage of the grid (Edgar/Jen/L3s)
  • Conclusions / Outlook (Edgar/Jen)

Action Items

  • Recovery workflows - Jen - ongoing
    • Diego got us an updated recovery workflow script!
    • discovered that the recovery workflows were creating some duplicates
    • have 1 more resubmission to go with our test case, then we'll comb through the outputs carefully to make sure everything is OK then start the mass re-processing.
  • we need to add a daily report on Workflow stats
    • How many workflows running, pending, waiting, stuck -
      • Jen -come up with template report
      • Edgar - please comment on workflow statuses I feel like we are not always communicating what workflows are in a waiting status for various issues
      • Diego - how hard would it be to have a "manual switch" that we can set on workflows for "waiting" so if there is a group of workflows that we are waiting back from a site/requesters to close out we can put the workflows in waiting so that things that are in "complete" really are ready to be closed or need to be looked at.
  • Diego - Can we have the script you wrote for finding stuck workflows?
    • Diego will put it in a public place so we can add it to svn
    • Is it documented yet?
      • need to pull documentaion out of e-log and put it on the twiki - Jen

AOB

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2013-07-17 - EdgarFajardo
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback