INDICO LInk: https://indico.cern.ch/conferenceDisplay.py?confId=254665

Attending

Jen, John, Edgar, Andrew, Diego, Alan

  • Jul 23 --> Jul 30 Sunil
  • Jul 30 --> Aug 6 Sara
  • Jen May be taking Thursday off
  • Jen will be taking some vacation time in Mid August exact dates, it will be in the time period of Aug 10-28th, I just have to figure out exact travel dates still.
  • Diego off 7-9, gone after Wed 16th

Issues last week

  • Problems with missing blocks in the Upgrade Workflows
    • Edgar - how prevasive is this? how are your recovery tests going? I found more yesterday
      • thinks he is finished, cloned with block white list
      • anymore we find will be sent Edgar's way
  • Test WF's for commissioning
    • John, you had some questions on the updated twiki's is everything working now?
      • he was able to assign the test but was unable to see WF's in WMStats
      • needs to look in cmsweb-testbed found them
    • There are conflicts in the documentation and between scripts for this in git and svn which is right? Can we get rid of the inaccurate pieces?
      • John will update the twiki's, teams and sites, documented for production
  • Agents
    • cmssrv113: agent was using an old certificate, Alan fixed it by creating a symink to new certificate.
    • vocms235:timeout problem, getting information from jobdump couch view takes long time. Seangchan fixed it by increasing timeout threshold to 5 min.
  • WFs
    • spinoso_TOP-Summer12_FS53-00019_R2906_B21_01_LHE_130725_104236_4840 , had a RSS memory problem. Rerun with lower splitting.
    • pdmvserv_SUS-Summer12pLHE-00001_2_v0_STEP0ATCERN_130724_010909_8610, config problem WF aborted

Site Issues

KNU does not need to be commissioned it is working just fine. Criteria: Site readiness < 60% (Last 3 Months & Last 1 week)

Site in MC Slots Status Notes Issues
T2_RU_IHEP 700 - re-commissioned Ok
T2_AT_Vienna 212 skip under commissioning Ok
T2_GR_Ioannina 94 skip under commissioning good T2 links to/from T1
T2_KR_KNU 300 skip does not need recommissioning it is working just fine SAM mc, jobsubmit
T2_UA_KIPT 300 skip confirm slots with site SAM swinst, jobsubmit

Agents

  • vocms216 is being drained.
  • vocms237 - do we need to put in the patch? yes, Diego will put it on the deployment page

Workflows

  • Waiting to hear back on the Recovery workflows before mass re-submission
    • jen will find e-mail and give answers on why this is not so straight forward

What information do we need to drain out of Diego's brain in the next 2 wks?

IEEE Paper

Draft Outline #1

  • Introduction (Why we need to run so much simulations, why we need to do a rereconstruction of the data) (Edgar/Jen)
  • a brief discussion of what the different types of workflows are, and how they are processed differently (Diego/Jen/Edgar)
  • monitoring for T1 & T2 sites(Diego/Jen/Edgar)
  • How we ran prior to 2011
    • ProdAgent vs WMAgent ( Diego/Alan) (Focus on differences and improvements)
    • Reprocessing and Production (Jen/Xavier) (How this was handled with ProdAgent and why the need to move to another framework
  • How we ran with WMAgent (after 2011)
    • WMAgent /ReqMgr/Workqueue (Diego/Edgar/Alan) General comment on how it works
    • PREP/ReqmG Interaction (Vincenzo?)
    • Organization of the workflow team and operations around it (Edgar)
  • Achievements
    • Events reconstructed (L3s)
    • Usage of the grid (Edgar/Jen/L3s)
  • Conclusions / Outlook (Edgar/Jen)

Action Items

  • Talk to Jacob, Andrew and Ajit to produce Global Picture of upgrade workflows. Jen
    • there isn't a script, take one that is failing and do events per lumi and time per event in the failing lumisections
  • Write twiki disk/tape separation T1_IT_CNAF. Edgar ongoing
  • SVN - IEEE paper - Edgar. Ongoing.
  • Recovery workflows - Jen - ongoing
    • first 2 workflows are completely through and now we are waiting for people to really look and make sure that there are no show stoppers before we do the other 50.
  • we need to add a daily report on Workflow stats
    • A new state for completed and already dealt with ACDC.
    • How many workflows running, pending, waiting, stuck
    • Is it documented yet?
      • need to pull documentaion out of e-log and put it on the twiki - Jen
    • Do we understand how to find the stuck workflows and fix them without Diego's help?

AOB

  • what are the biggest problems that we are facing right now as the workflow team
    • hand holding with ACDC/Cloning/Resubmission
      • subscriptions of data tiers is manual - Edgar will work on this
    • we are getting better about discovering why workflows are stuck but do not have the tools to fix the problems ourselves and are dependent on the developers to fix problems which is an issue over the weekend.
    • what are the problems with stuck workflows with a few blocks one of the developers generally has to go in and manually fix them, if this happens on the weekend we are stuck until Monday
    • CVMS issue
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2013-07-30 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback