Workflow Team Meeting - May 22

Vidyo Link

Attending

  • FNAL : Jen and Dave, Luis, John and SeangChan
  • CERN : Julian Andrew Alan
  • Sara & Andrew

Personel

May 15 -> May 22 Sara
May 22 -> May 29 Xavier

News

  • FNAL subscriptions still on hold, we are manually closing out WF's as discussed at last weeks meeting
    • Dave plans on rejecting what we have and sending everything to T0_CH_CERN_MSS instead.
    • for the time being we should continue to manually close out FNAL WF's when they meet the criteria other than tape.
    • Julian still needs to change the script.
  • Store Results: I know Luis has a few things he wants to clean up before turning it over. Where are we on the turnover? When can we meet to discuss?

Sara's Notes

  • I tried a few times to lcg-cp something this week and keep failing. Is there a new procedure or something? Accessing logs from pfn used to NOT be so tricky.
  • What was the problem with the ErrorHandlers Monday/Tuesday?
    • couch connection errors to central couch. We can retry, and wait for the next cycle, if it fails more than once have it go down.
    • github issue has already been opened on it.
  • Does anyone else have issues making CMS tickets? I couldn't submit one this morning.
  • I have to leave the meeting at 16:30

Agent Issues

  • 235 and 216 keep filling disk due to high numbers of jobs on the machines. 14644
    • shutting down JobCreator and JobSubmittor to allow things to "settle" and couch to shrink back down seems to be working until we can get the agents fixed
    • we had 2 ACDC's that asked for a trillion events, the WF had really bad filter effecency
    • we will change the alarm to 60% and then drain and rotate
  • 201 and 85 are draining for upgrade
    • I need validation from developers -> to avoid deleting info
    • Can I switch vocms85 from 'reproc_lowprio' to 'mc'?
    • Andrew says OK...
  • Extension error: https://github.com/dmwm/WMCore/issues/5148

Workflow issues

Site issues

Andrew's questions

Notes for Friday's Meeting

Problems That have affected the System In last two months

  • GlideIn Collector overload - Too many pending jobs (March 03) Elog discussion:
    • Slowed down the job matching causing lower number of running jobs.
    • Caused by WMAgent sucking too many jobs at a time -> Bottleneck.
  • Another GlideIn - Collector overload (April 03) Elog discussion
  • WMAgent memory and CPU overload (April 24-29) Elog discussion:
    • A component trying to allocate too much RAM (ErrorHandler)
    • Caused a spike in CPU.
    • Agent completely crashed every ~20 hours.
  • GlideIn Front end proxy not renewed (Apr 29) Elog discussion
    • Jobs were not assigned (General failure).
    • A certificate had expired
  • Central couch rotation problem (May 2-12) Elog discussion
    • Affected WMStats information -> we weren't able to see workflow information and/or resubmit workflow
    • When combined with WMAgent redeployment --> Some workflow information was lost or got corrupted.
      • Old workflows stuck in "running" (Not enough info to complete them and no progress).
    • Workflows coudln't be spotted earlier.
    • System was crippled for a whole week.

Some other external events:

  • Highest prio workflows injected
    • Pushed aside everything else.
    • Everyone looking only this highest priority workflows.

Actions taken

  • Limit the number of pending jobs in the system
    • Controlling site thresholds
    • WMAgent also will have a soft limit
    • Avoid GlideIn collector and WMAgent overload.
  • Wmagent ErrorHandler config -> avoid RAM overloading.
  • Central couch rotation procedure -> Needs a small downtime to be safest.
  • Revise WMAgent redeployment procedure -> Avoid loosing information.
  • Revise tools and procedures for workflow monitoring
    • Catch workflows that are having problems in an early stage
    • Catch stuck workflows and identify causes
    • Distill workflows affected by request, site, or infrastructure issues.

Summary

  • There were several issues that affected infrastructure, hence the system's performance.
  • We also had some unusual conditions on workflow priority.
  • Furthermore, our ability to control the system was impaired.
  • We are working in preventing and correcting issues under our control.

AOB

-- JenniferAdelmanMcCarthy - 22 May 2014

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2014-05-28 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback