https://indico.cern.ch/conferenceDisplay.py?confId=254679

Attending

Luis, Seangchan, Jen, Dave (FNAL) Jullian (CERN) Xaviar Andrew

Personel

Oct 29 --> Nov 4 Adli

  • I see nothing for this week. Xaviar who is on shift? Sunil
  • As Jen will be dealing with recovery efforts this week I'm hoping Dorian can jump in and do some extra agent/workflow monitoring US time.
  • Xaviar will put Adli in one more training shift and then put out the EU shift schedule.
  • Luis will be gone Dec 11-13, Dec 16-20 Luis will be working remotely
  • Andrew? - Dec 23-Jan 3 - will be in Chicago
  • everyone please get schedules sent in so we can plan shifts
  • Jen will be working remotely Fri/Mon, if things are slow will take the day off.

Agent Issues

  • v0.9.82 deployed in vocms85, vocms202, vocms216, vocms227, vocms234, vocms235, cmssrv98, cmssrv112, cmssrv113. Next: vocms201 (in drain)
    • vocms201 has still workflows running:
      pdmvserv_SUS-Summer12FS53-00099_00023_v0__131026_120436_3712 94.95%
      pdmvserv_SUS-Summer12FS53-00082_00023_v0__131026_120658_9360 94.2%
    • And some more completed.
  • Can I please request that when you update an agent you put the date on the https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsWorkflowTeamWmAgentRealeases twiki? Seangchan and I were trying to figure out what patches went in on what days to better isolate the patch that is causing our current missing lumi issues. I was scanning e-logs for announcements but it would be nice to just put it on the twiki as well as long as you are updating it.
  • Agent Issues during the weekend: vocms216 and 235 drop off of jobs
    • too few jobs pending - 84 assigned-aproved workflows, but there is anyone assigned. Local work queue in both agents is not pulling job from the global workqueue because there is no work for them
    • too many idle jobs - we thought it was related to condor, it started to move on Monday.

Workflow issues:

  • First of the backfill WF's made it through OK, there were some problems getting them started but once they did we actually had less problems with them due to the parent bug being fixed.
    • Seangchan will pull the data out of Couch for doing latency studies and give it to Dave that will give us a better idea of what we can do to optimize the system.
  • Summary of the Redigi WF's in complete as of Sunday night FNAL time:
    • we had various agent and file read issues with these workflows.
    • We have been giving the developers time to look at the workflows that are at 100%, but only have log collect jobs failing so they can try to figure out what is happening in the agent.
    • The workflows they have already been looked at I have begin the recovery procedure on and we are getting to 100%
    • I am still unsure why we have so many workflows complaining about bad/corrupt files. When I tried to ACDC them they still gave the same errors, but running recovery is getting me to 100%
    • Workflows stuck in acquired: some due to sites downtime. Slowly moving.
franzoni_Fall53_2011A_ElectronHad_Run2011A-v1_Prio2_5312p1_130916_235208_6171 no acdc failures but still 99.7% -https://cmslogbook.cern.ch/elog/Workflow+processing/10923
franzoni_Fall53_2011A_Jet_Run2011A-v1_Prio1_5312p1_130916_235201_2576 acdc done, still getting performance kills -s://cmslogbook.cern.ch/elog/Workflow+processing/10915
franzoni_Fall53_2011A_METBTag_Run2011A-v1_Prio1_5312p1_130916_235506_5872 running recovery scripts -https://cmslogbook.cern.ch/elog/Workflow+processing/10924
franzoni_Fall53_2011A_MuEG_Run2011A-v1_Prio1_5312p1_130916_235024_8520 recovery got us to 100%
franzoni_Fall53_2011A_SingleElectron_Run2011A-v1_Prio1_5312p1_130916_235126_4483 ACDC ran with no errors - but still not at 100%https://cmslogbook.cern.ch/elog/Workflow+processing/10914
franzoni_Fall53_2011B_HT_Run2011B-v1_Prio1_5312p1_130916_235352_6611 ACDC skim failed file read errors - running recovery:https://cmslogbook.cern.ch/elog/Workflow+processing/10927
franzoni_Fall53_2011B_Jet_Run2011B-v1_Prio1_5312p1_130916_235443_2635 Only Errors are log collect but only at 99%https://cmslogbook.cern.ch/elog/Workflow+processing/10922
franzoni_Fall53_2011B_SingleElectron_Run2011B-v1_Prio1_5312p1_130916_234926_3581 https://cmslogbook.cern.ch/elog/Workflow+processing/10900 - still need developers to look
linacre_Fall53_2011A_BTag_Run2011A-v1_Prio1_5312p1_131014_173207_9252 starting recovery -missing lusture file,https://cmslogbook.cern.ch/elog/Workflow+processing/10928
linacre_Fall53_2011A_DoubleElectron_Run2011A-v1_Prio1_5312p1_131014_173217_6318 acdc failure, lusture missinghttps://cmslogbook.cern.ch/elog/Workflow+processing/10929
linacre_Fall53_2011A_Photon_Run2011A-v1_Prio1_5312p1_131014_173238_3886 https://cmslogbook.cern.ch/elog/Workflow+processing/10904 -performance timeouts acdc
linacre_Fall53_2011A_SingleMu_Run2011A-v1_Prio1_5312p1_131014_173248_9770 https://cmslogbook.cern.ch/elog/Workflow+processing/10735 - waiting devel
linacre_Fall53_2011B_DoubleElectron_Run2011B-v1_Prio1_5312p1_131014_173259_942 running ACDC
linacre_Fall53_2011B_DoubleMu_Run2011B-v1_Prio1_5312p1_131014_173309_8207 running recovery https://cmslogbook.cern.ch/elog/Workflow+processing/10912
linacre_Fall53_2011B_Photon_Run2011B-v1_Prio1_5312p1_131014_173330_3061 Only Log Collect failed and still not 100%https://cmslogbook.cern.ch/elog/Workflow+processing/10913
linacre_Fall53_2011B_SingleMu_Run2011B-v1_Prio1_5312p1_131014_173340_3758 Only LogCollect failed and still not 100%https://cmslogbook.cern.ch/elog/Workflow+processing/10900

AOB

  • vocms174 should always have the latest WMA deployment installed vs having a spare VM with the environment to run scripts. https://cmslogbook.cern.ch/elog/Workflow+processing/10934
  • DBS stand alone tests, Seangchan will give instructions on how to set things up and we need to run the data
    • we need to check PhEDEx and DBS3
    • testbed agent - cmssrv94 or ask Prestlov which one we are testing for testbed - Julian will set this up and work with Jullian and Prestlov about how to set this up.
  • if WMStats shows a componenet is down it may just be a thread down, please e-log the problem that WMStats is complaining about and re-start the component.

Site Readiness & Tickets

     total   new   net  unmodified                              
        32    6    +5      10       Facilities                    
        16    5    +3       8       SAM tests                     
        19   10    +5       6       Data Transfers                
         5    1    +1       4       Data Operations               
         5    1    +1       3       CMS CMSPublic.HammerCloud               
         3    0    +0       2       Register new site             
         3    0    -2       1       CAF Operations                
         2    2    +1       0       Analysis Operations           
         1    1    +1       0       Consistency Checks            
         1    1    +1       0       Central Workflows             
         0    1    +0       0       0                             
         0    0    +0       0                                     

Waiting Room

last updated on: 11:21:42 04 Nov 2013 by GokhanKandemir View topic

-- JenniferAdelmanMcCarthy - 05 Nov 2013

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2016-07-22 - StephanLammel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback