Luis, Seangchan, Jen, Dave (FNAL) Jullian (CERN) Xaviar Andrew


Oct 29 --> Nov 4 Adli

  • I see nothing for this week. Xaviar who is on shift? Sunil
  • As Jen will be dealing with recovery efforts this week I'm hoping Dorian can jump in and do some extra agent/workflow monitoring US time.
  • Xaviar will put Adli in one more training shift and then put out the EU shift schedule.
  • Luis will be gone Dec 11-13, Dec 16-20 Luis will be working remotely
  • Andrew? - Dec 23-Jan 3 - will be in Chicago
  • everyone please get schedules sent in so we can plan shifts
  • Jen will be working remotely Fri/Mon, if things are slow will take the day off.

Agent Issues

  • v0.9.82 deployed in vocms85, vocms202, vocms216, vocms227, vocms234, vocms235, cmssrv98, cmssrv112, cmssrv113. Next: vocms201 (in drain)
    • vocms201 has still workflows running:
      pdmvserv_SUS-Summer12FS53-00099_00023_v0__131026_120436_3712 94.95%
      pdmvserv_SUS-Summer12FS53-00082_00023_v0__131026_120658_9360 94.2%
    • And some more completed.
  • Can I please request that when you update an agent you put the date on the twiki? Seangchan and I were trying to figure out what patches went in on what days to better isolate the patch that is causing our current missing lumi issues. I was scanning e-logs for announcements but it would be nice to just put it on the twiki as well as long as you are updating it.
  • Agent Issues during the weekend: vocms216 and 235 drop off of jobs
    • too few jobs pending - 84 assigned-aproved workflows, but there is anyone assigned. Local work queue in both agents is not pulling job from the global workqueue because there is no work for them
    • too many idle jobs - we thought it was related to condor, it started to move on Monday.

Workflow issues:

  • First of the backfill WF's made it through OK, there were some problems getting them started but once they did we actually had less problems with them due to the parent bug being fixed.
    • Seangchan will pull the data out of Couch for doing latency studies and give it to Dave that will give us a better idea of what we can do to optimize the system.
  • Summary of the Redigi WF's in complete as of Sunday night FNAL time:
    • we had various agent and file read issues with these workflows.
    • We have been giving the developers time to look at the workflows that are at 100%, but only have log collect jobs failing so they can try to figure out what is happening in the agent.
    • The workflows they have already been looked at I have begin the recovery procedure on and we are getting to 100%
    • I am still unsure why we have so many workflows complaining about bad/corrupt files. When I tried to ACDC them they still gave the same errors, but running recovery is getting me to 100%
    • Workflows stuck in acquired: some due to sites downtime. Slowly moving.
franzoni_Fall53_2011A_ElectronHad_Run2011A-v1_Prio2_5312p1_130916_235208_6171 no acdc failures but still 99.7% -
franzoni_Fall53_2011A_Jet_Run2011A-v1_Prio1_5312p1_130916_235201_2576 acdc done, still getting performance kills -s://
franzoni_Fall53_2011A_METBTag_Run2011A-v1_Prio1_5312p1_130916_235506_5872 running recovery scripts -
franzoni_Fall53_2011A_MuEG_Run2011A-v1_Prio1_5312p1_130916_235024_8520 recovery got us to 100%
franzoni_Fall53_2011A_SingleElectron_Run2011A-v1_Prio1_5312p1_130916_235126_4483 ACDC ran with no errors - but still not at 100%
franzoni_Fall53_2011B_HT_Run2011B-v1_Prio1_5312p1_130916_235352_6611 ACDC skim failed file read errors - running recovery:
franzoni_Fall53_2011B_Jet_Run2011B-v1_Prio1_5312p1_130916_235443_2635 Only Errors are log collect but only at 99%
franzoni_Fall53_2011B_SingleElectron_Run2011B-v1_Prio1_5312p1_130916_234926_3581 - still need developers to look
linacre_Fall53_2011A_BTag_Run2011A-v1_Prio1_5312p1_131014_173207_9252 starting recovery -missing lusture file,
linacre_Fall53_2011A_DoubleElectron_Run2011A-v1_Prio1_5312p1_131014_173217_6318 acdc failure, lusture missing
linacre_Fall53_2011A_Photon_Run2011A-v1_Prio1_5312p1_131014_173238_3886 -performance timeouts acdc
linacre_Fall53_2011A_SingleMu_Run2011A-v1_Prio1_5312p1_131014_173248_9770 - waiting devel
linacre_Fall53_2011B_DoubleElectron_Run2011B-v1_Prio1_5312p1_131014_173259_942 running ACDC
linacre_Fall53_2011B_DoubleMu_Run2011B-v1_Prio1_5312p1_131014_173309_8207 running recovery
linacre_Fall53_2011B_Photon_Run2011B-v1_Prio1_5312p1_131014_173330_3061 Only Log Collect failed and still not 100%
linacre_Fall53_2011B_SingleMu_Run2011B-v1_Prio1_5312p1_131014_173340_3758 Only LogCollect failed and still not 100%


  • vocms174 should always have the latest WMA deployment installed vs having a spare VM with the environment to run scripts.
  • DBS stand alone tests, Seangchan will give instructions on how to set things up and we need to run the data
    • we need to check PhEDEx and DBS3
    • testbed agent - cmssrv94 or ask Prestlov which one we are testing for testbed - Julian will set this up and work with Jullian and Prestlov about how to set this up.
  • if WMStats shows a componenet is down it may just be a thread down, please e-log the problem that WMStats is complaining about and re-start the component.

Site Readiness & Tickets

     total   new   net  unmodified                              
        32    6    +5      10       Facilities                    
        16    5    +3       8       SAM tests                     
        19   10    +5       6       Data Transfers                
         5    1    +1       4       Data Operations               
         5    1    +1       3       CMS CMSPublic.HammerCloud               
         3    0    +0       2       Register new site             
         3    0    -2       1       CAF Operations                
         2    2    +1       0       Analysis Operations           
         1    1    +1       0       Consistency Checks            
         1    1    +1       0       Central Workflows             
         0    1    +0       0       0                             
         0    0    +0       0                                     

Waiting Room

last updated on: 11:21:42 04 Nov 2013 by GokhanKandemir View topic

-- JenniferAdelmanMcCarthy - 05 Nov 2013

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2016-07-22 - StephanLammel
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback