Operational Support Responsiblities


  • Each shifter will be responsible for monitoring the system, and sending a report to elog between 3-5 in their local time zone every day
    • this is a change from previous setup where people would be on shift for a week at a time
  • Work should take between 30 min to a couple hours depending on the load on the system.

Sample Shift Report


Components restarted and summary of error messages

Site status

  • report should include all sites with issues and you are watching plus links to any tickets opened/closed
  • interactive Dashboard view
    • Drill down into reprocessing, production views and look for sites showing a high failure rate, also look at jobtype for sites, sometimes production can be 100% successful but merge 100% failure and it will show "green" everywhere but on the job type plot!
    • Drill in through WmStats with workflows to determine if it is a site issue or a workflow issue, if it is a site issue, file a ggus ticket AND talk to your site support team!
      • if you are unsure, send an email to the site support team cms-comp-ops-site-support-team at cern.ch and help them dig!
    • Ggus tickets closed
    • Sites to keep a close eye on because they have been having issues, but not yet at the point to open a ggus ticket, if we are watching a site, make sure you email the site support team cms-comp-ops-site-support-team at cern.ch to watch the site as well. Sometimes issues show up for us has high failure rates on a specific small workflow that are masked by larger successful workflows.
    • if a workflow has a high file read rate check das https://cmsweb.cern.ch/das/ to see if the data is 100% on disk and where. If problem is persistent, for now open a JIRA ticket for the workflow noting the issues. This should trigger the Workflow Traffic Controler or the L3's to check to see if we need to increase replication.
    • if we are seeing workflows with high instance of excit code:
      • 137 - means node is running out of memory and site has killed the job send email to site support team to investigate
      • 99109: Uncaught exception in WMAgent step executor (often staging out problems)
        • Check site readiness.
        • Check the logs for the failed jobs. Ask the site support team for their opinion.

Workflow issues

  • in general you can ignore any workflow with backfill, test or RelVal in it's name
  • Check Wmstats for any workflows with a high failure rate
    • drill in, is it a site issue? workflow issue? are the failures just due to tails and we are down to jobs that will timeout etc? is it just ramping up and "infant mortality" of the first lumi's that will fail anyway?
    • Summary of workflows having problems links to elogs if already posted
    • List of workflows that actions have been taken on and actions taken, links to elogs

Other issues noted during the shift

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r9 - 2016-09-23 - JeanrochVlimant
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback