UNDER CONSTRUCTION, INSTRUCTIONS MIGHT CHANGE

Operational Support Responsiblities


Introduction

  • Each shifter will be responsible for monitoring the system, and sending a report to elog between 3-5 in their local time zone every day
    • this is a change from previous setup where people would be on shift for a week at a time
  • Work should take between 30 min to a couple hours depending on the load on the system.

Sample Shift Report

Agents

Components restarted and summary of error messages

Site status / Workflow

  • report should include all sites with issues and you are watching plus links to any tickets opened/closed
  • interactive Dashboard view
    • Drill down into reprocessing, production views and look for sites showing a high failure rate, also look at jobtype for sites, sometimes production can be 100% successful but merge 100% failure and it will show "green" everywhere but on the job type plot!
    • Drill in through WmStats with workflows to determine if it is a site issue or a workflow issue, if it is a site issue, file a ggus ticket AND talk to your site support team!
      • if you are unsure, send an email to the site support team cms-comp-ops-site-support-team at cern.ch and help them dig!
    • Ggus tickets closed
    • Sites to keep a close eye on because they have been having issues, but not yet at the point to open a ggus ticket, if we are watching a site, make sure you email the site support team cms-comp-ops-site-support-team at cern.ch to watch the site as well. Sometimes issues show up for us has high failure rates on a specific small workflow that are masked by larger successful workflows.
    • Error Code Mapping
      • we should be working off the same list as the WTC on error code tracking. If we catch these codes while workflows are running, open ggus tickets and email site support if it is site related. If it is workflow related JIRA tickets should be opened.

Workflow issues

  • in general you can ignore any workflow with backfill, test or RelVal in it's name
  • ignore all workflows in status abort, you will see high error rates in dashboard while it is aborting
  • Check Wmstats for any workflows with a high failure rate
    • drill in, is it a site issue? workflow issue? are the failures just due to tails and we are down to jobs that will timeout etc? is it just ramping up and "infant mortality" of the first lumi's that will fail anyway?
    • Summary of workflows having problems links to elogs if already posted
    • List of workflows that actions have been taken on and actions taken, links to elogs

Other issues noted during the shift

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r10 - 2016-09-30 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback