INDICO LInk: https://indico.cern.ch/conferenceDisplay.py?confId=254662

Attending

  • Jen John, Diego, Edgar Sara

Personel

  • Coming off shift - Sara
  • going on shift Sunil
  • Jen will be taking some vacation time in Mid August exact dates TBD
  • Edgar will be in Paris from friday to monday will not be monitoring at all over the weekend or friday and monday - jen will have to do the monday operations meeting

Issues last week

  • nothing unusual
  • new ticket with Finnish site
  • is IN2P3 problem fixed - no - there is a ggus ticket but it hasn't been addressed yet. Next Monday is holiday in France so it needs to be taken care of ASAP or we won't have it until next Tues at the earliest *235 is running in late binding mode, meaning it can run in any of those sites when you analyse a job that is pending it will check which of the glidins are available
we need to come up with a better way of determining which sites have glidin problems when we have a long list of sites that workflows can work look at the dashboard site for jobs pending to see which sites have a lot of jobs pending and not enough running and dig in on that site
  • 216 watch Analytics Collector and couch seems to have "fixed it's self"

Site Issues

  • T2_US_Vanderbilt - not giving enough slots over weekend
  • ASGC back as opportunistic processing

  • T2_FI_HIP & T2_IT_Bari
    • moved out of drain - keeping an eye on these sites during this week

  • T2_GR_Ioananniana & T2_AT_Vienna - need to be commissioned
    • Edgar doesn't have a lot of time to help with the commissioning
    • do we have instructions listed so that John can do this? John has a basic list and will try and talk to Edgar when stuck
    • let's get this on a twiki linked off the Workflow Team Main twiki

Site in MC WN Status Notes
T0_CH_CERN 2000 down leave down
T1_RU_JINR 0 skip waiting to be commissioned
T1_TW_ASGC 1500 drain working in opportunistic mode
T1_UK_RAL_Disk 2000 down leave down, it exists only for PhEDEX
T2_BR_UERJ 200 drain network problems
T2_GR_Ioannina 0 skip need to be commissioned
T2_IN_TIFR 200 drain keep in drain as long as possible - everything is an issue at this site
T2_RU_IHEP 1000 drain in wating room
T2_RU_INR 100 drain network problem
T2_RU_JINR 1500 drain network problem
T2_RU_RRC_KI 0 drain network problem
T2_RU_SINP 100 drain network problem
T2_TR_METU 200 drain in waiting room
T2_AT_Vienna 400 skip need to start commissioning - John
T2_FR_GRIF_IRFU 0 skip shares WNs with LLR - as long as we are using GRIF_LLR we don't need IRFU
T2_KR_KNU 300 skip needs re-commissioning
T2_RU_PNPI 10 skip in waiting room
T2_UA_KIPT 500 skip too small to make it work

Agents

  • Agents upgraded in previous week
  • Agents down/drain for upgrade
    • 201 - in drain for upgrade
    • 112 - is still in drain
  • Agents with other issues
    • do we undertand the dip in jobs on 235 over the weekend? It seemed to "fix itself" which always makes me nervous
  • files not injected into DBS,
  • with late binding when we have jobs that fail fast we don't get good site reporting to dashboard or WMStats which gives us problems in how do we track these issues
    • diego is looking into it he has an idea how he can fix it, fix may not be 100% relyable when jobs fail to fast we don't get FWJR back so we need the condor history anyway

Workflows

  • cleanup unmerged jobs failing at FNAL - Xavier and Edgar looking at them after the meeting.

IEEE Paper

Draft Outline #1

  • Introduction (Why we need to run so much simulations, why we need to do a rereconstruction of the data) (Edgar/Jen)
  • a brief discussion of what the different types of workflows are, and how they are processed differently (Diego/Jen/Edgar)
  • monitoring for T1 & T2 sites(Diego/Jen/Edgar)
  • How we ran prior to 2011
    • ProdAgent vs WMAgent ( Diego/Alan) (Focus on differences and improvements)
    • Reprocessing and Production (Jen/Xavier) (How this was handled with ProdAgent and why the need to move to another framework
  • How we ran with WMAgent (after 2011)
    • WMAgent /ReqMgr/Workqueue (Diego/Edgar/Alan) General comment on how it works
    • PREP/ReqmG Interaction (Vincenzo?)
    • Organization of the workflow team and operations around it (Edgar)
  • Achievements
    • Events reconstructed (L3s)
    • Usage of the grid (Edgar/Jen/L3s)
  • Conclusions / Outlook (Edgar/Jen)

Action Items

  • Recovery workflows - Jen - ongoing
    • Diego got us an updated recovery workflow script!
    • discovered that the recovery workflows were creating some duplicates
  • we need to add a daily report on Workflow stats
    • How many workflows running, pending, waiting, stuck -
      • Jen -come up with template report
      • Edgar - please comment on workflow statuses I feel like we are not always communicating what workflows are in a waiting status for various issues
      • Diego - how hard would it be to have a "manual switch" that we can set on workflows for "waiting" so if there is a group of workflows that we are waiting back from a site/requesters to close out we can put the workflows in waiting so that things that are in "complete" really are ready to be closed or need to be looked at.
  • Diego - Can we have the script you wrote for finding stuck workflows?
    • Diego will put it in a public place so we can add it to svn
    • Is it documented yet?
      • need to pull documentaion out of e-log and put it on the twiki - Jen
  • Problems with dbsTest.py - done???

AOB

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2013-07-16 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback