Ajit Julian - CERN Andrew Jen, Dave M, SeangChan, Luis, Yuyi - FNAL

Dec 3 --> Dec 10 Sara
Dec 10 --> Dec 17 Xavier

  • Julian's holidays: (dec 27th to 29th) and (jan 7th to 11th)
  • Adli ~maybe~ will be on holiday 1-6 january
  • Jen Around the week of Christmas but probably off 29th to Jan 6
  • Jen Off Dec 18th
  • Dec 23-Jan1 - Xavier and Sunil will be on shift but working from home
  • CERN closed Dec 22-Jan6
  • Luis will be gone Dec 11-13, Dec 16-20 Luis will be working remotely

Review of last weeks issues

  • As of Monday Night FNAL time there are no stuck WF's!!! That said, most of the week was spent debugging and really trying to understand what is happening with our stuck workflows and actually FIXING things instead of shoving them along.
  • Problems discovered:
    • Agent issues that affected:
      • Dataset marked as DELETED by vocms201: /Higgs0PHf05ph0ToZZTo4L_M-125p6_8TeV-powheg15-JHUgenV3/Summer12-START53_V7C-v1/GEN-SIM (inconsistencies between DBS - DBS3).
      • Merge Jobs failed when a site is in draining: we don't know which site: * pdmvserv_SUS-Summer12FS53-00109_00031_v0__131129_122529_6419
      • TaskArchiver: a DB query that takes too long to complete (several hours), it's currently under study. This prevents workflows from moving to "running" to "complete"
      • PhedexInjector: a bug that keeps blocks or datasets to be accurately subscribed to Phedex. Is usually fixed by running a script. This prevents workflows that are "complete" to move to "close-out"
      • General stability of the Agent: Under heavy load, components crash very frequently (every six/ten hours a component needs to be restarted). I'm sending a plot that Adli made counting crashes per components, per agent. As you see the MC agents crash more frequently.
      • GlideIn issues: glitches or downtimes in GlideIn Factory or FronEnd prevents jobs to be assigned (this doesn't happen very often).
      • Priority Queue: As we were currently discussing, some workflows get stuck currently because unfinished log collect or cleanup jobs. This might need some deeper change in WMAgent.
      • Thresholds are not set properly: Vincenzo says that if a site has less than 10 slots for running jobs, we should not consider it for mc
    • Site issues:
      • Blackhole nodes: When a node in a site has problems but keeps receiving jobs, can quickly fail thousands of jobs. We usually deal with this by draining the site, and recovering any jobs that failed. last week we opened at least 4 tickets to sites. * sometimes there were some successful production jobs that finished, but cannot be merged because a site is drained.
      • xrootd: Last week we have lots of jobs failing due to xrootd unavailabilty.
    • Request issues: this causes request failing, but are harder to debug:
      • filter efficiency issues
      • dataset / input data inconsistency
      • wrong job splitting
      • wrong site assignment

  • Challenges:
      • The major challenge right now is the lack of a unique straight-forward way to decide which problem is affecting a particular workflow.
      • Sometimes a workflow appears as having job errors due two site issues, but when retried some job information is lost (Agent issues) , and then it happened that it also had an original request issue.
      • Unknown causes: Sometimes a workflow (or a workflow batch) is affected by simultaneous causes, making it harder to debug.
      • Currently I have at least 5 workflows that I suspect have some misconfiguration, however they also had site issues so they are being retried.
      • How we deal with it?
  • Agent issues:
      • Our developers are studying currently the known bugs, however there are some that are harder to spot (i.e. the lost information)
      • when a component crashes we usually restart it, sometimes a deeper error is behind the crash.
      • we usually need to force-complete workflows when the agents fail to detect that it is ready.
        • We need to make sure that force completeing isn't making our problems worse. There is some suspicion that is what is leading to our TaskArchiver problems
    • Site issues:
      • we put the site in drain when we have failures, however they can be spotted when there are already several jobs failed.
      • we usually deal with this by ACDC or full kill-and-clone (This week I did myself 16 ACDC's and 48 clones, some of them re-clones or re-ACDC's).
      • This means manually tracking control of ACDC's and clones, and dataset progress monitoring.
      • Problem with log file access, especially when there is very little time between jobs running and access to logfiles,

Agent Issues

  • See stuck WF issues above
  • If possible discuss better mechanis to replace what do
  • ErrorHandler crashing: big failed job due to overload stack memory could not be processed. Temporary solved but database needs to be cleaned.
  • Upgrade to condor 8, vocms216 & 202, 85 start to drain
  • Component_down_statistic_1.png
    • let's look at this a bit more and decide if we have ideas of how to make this a useful plot

Site Issues

     total   new   net  unmodified                              
         8    3    +0       4       Central Workflows             

Workflow_Issues"></a>Workflow Issues

  • franzoni_Fall53_2011A_Jet_Run2011A-v1_Prio1_5312p1_130916_235201_2576 - still running recovery, the rest of the ReReco WF's are done

The Andrew's Question's


-- JenniferAdelmanMcCarthy - 09 Dec 2013
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2013-12-10 - JenniferAdelmanMcCarthy
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback