Dashboard Monitoring Scripts

General

This page contains a bunch of scripts that were written to monitor Workflow jobs and SiteReadiness information in Dashboard.

  • Every script will point to its location on afs, github and to the accompanying metrics in Dashboard.
  • The scripts are run by the acrontab on the CMS t1 account.
  • Generally all monitoring scripts can be found here on github.
  • All the scripts contain a Readme.rm with extra documentation !
  • E-group for info and alarms: <cms-comp-ops-monitoring-wf-ss-teams@cern.ch>

WFM - Workflow Monitoring of # pending/running jobs

Monitoring the number of pending / running jobs per site. Here we count all the Production, Processing, Merge, Log anc Clean jobs.

location
GitHub https://github.com/CMSCompOps/MonitoringScripts/tree/master/WFM_Input_DashBoard (the loose files)
Afs /afs/cern.ch/user/c/cmst1/scratch0/WFM_Input_DashBoard/runWFMonDBShort.sh
Dashboard metrics Metric 137, 138 : Pending and running jobs columns in production view
cmst1 cronjob 5,20,35,50 * * * * lxplus ssh vocms202 /afs/cern.ch/user/c/cmst1/scratch0/WFM_Input_DashBoard/runWFMonDBShort.sh &> /dev/null

WFM - Workflow Monitoring of # jobs per vobox/scheduler

Monitoring the number of pending / running jobs per scheduler, instead of per site as above.

location
GitHub https://github.com/CMSCompOps/MonitoringScripts/tree/master/WFM_Input_DashBoard/WFMperVOBox
Afs /afs/cern.ch/user/c/cmst1/scratch0/WFM_Input_DashBoard/WFMperVOBox/runWFMonDBShort_voboxes.sh
Dashboard metrics Metric 137, 138 : Pending and running jobs columns in production view (bottom of page)
cmst1 cronjob 5,20,35,50 * * * * lxplus ssh vocms202 /afs/cern.ch/user/c/cmst1/scratch0/WFM_Input_DashBoard/WFMperVOBox/runWFMonDBShort_voboxes.sh &> /dev/null

WFM - Workflow Monitoring Alarms

Alarms that are based on the number of pending / running jobs on a site, the pledge of the site and the status of the site (Down, Skip; ...) . They give a first impression if there is something wrong the production on the site-side.

location
GitHub https://github.com/CMSCompOps/MonitoringScripts/blob/master/WFM_Input_DashBoard/WFM_Alarms
Afs /afs/cern.ch/user/c/cmst1/scratch0/WFM_Input_DashBoard/WFM_Alarms/runWFMonDB_Alarms.sh
Dashboard metrics Metric 141, 142, 143, 144 : GlideIn alarm, 8h alarm, Instant alarm and 24h alarm column
cmst1 cronjob 0,15,30,45 * * * * vocms174 /afs/cern.ch/user/c/cmst1/scratch0/WFM_Input_DashBoard/WFM_Alarms/runWFMonDB_Alarms.sh &> /dev/null

WFM - Workflow Job Success

A script that calculates the total job success of a workflow. Grid-proc is between jobs that are done and aborted. App-proc is between app-succeeded and app-failed.

location
GitHub https://github.com/CMSCompOps/MonitoringScripts/tree/master/WFM_Input_DashBoard/JobSuccess
Afs /afs/cern.ch/user/c/cmst1/scratch0/WFM_Input_DashBoard/JobSuccess/run_jobSuccess.sh
Dashboard metrics Metric 148, 149 : Avg 24h grid/ app proc jobSuccess
cmst1 cronjob 5,20,35,50 * * * * lxplus ssh vocms202 /afs/cern.ch/user/c/cmst1/scratch0/WFM_Input_DashBoard/JobSuccess/run_jobSuccess.sh &> /dev/null

WFM - Avg # jobs per 24h

Script that calculates the average number of pending / running jobs per site in the last 24hours.

location
GitHub https://github.com/CMSCompOps/MonitoringScripts/tree/master/WFM_Input_DashBoard/Avg24hjobs
Afs /afs/cern.ch/user/c/cmst1/scratch0/WFM_Input_DashBoard/Avg24hjobs/run_avg24hjobs.sh
Dashboard metrics Metric 146, 147 : Avg 24h Pending / Running jobs
cmst1 cronjob 5,20,35,50 * * * * lxplus ssh vocms202 /afs/cern.ch/user/c/cmst1/scratch0/WFM_Input_DashBoard/Avg24hjobs/run_avg24hjobs.sh &> /dev/null

SiteReadiness - 1W&3M(>60%)

Script that calculates the site readiness of a site. A site is listed red as when the site readiness is below 60% for both the last week as the last 3 months.

location
GitHub https://github.com/CMSCompOps/MonitoringScripts/blob/master/SiteReadiness_Dashboard
Afs /afs/cern.ch/user/c/cmst1/scratch0/SiteReadiness_Dashboard/run_badSites_SiteReadiness.sh
Dashboard metrics Metric 152 : SiteReadiness 1W&3M(>60%)
cmst1 cronjob 5,20,35,50 * * * * lxplus ssh vocms202 /afs/cern.ch/user/c/cmst1/scratch0/SiteReadiness_Dashboard/run_badSites_SiteReadiness.sh &> /dev/nul

Waitingroom - Sites in waitingroom last day

A script that shows daily which sites are in the waitingroom.
location
GitHub https://github.com/CMSCompOps/MonitoringScripts/tree/master/Waitingroom_Dashboard
Afs /afs/cern.ch/user/c/cmst1/scratch0/Waitingroom_Dashboard/source_The_Run_File_WaitingRoom_Sites.sh
Dashboard metrics Metric 153 : Waitingroom last day
cmst1 cronjob 5,20,35,50 * * * * lxplus ssh vocms202 /afs/cern.ch/user/c/cmst1/scratch0/Waitingroom_Dashboard/source_The_Run_File_WaitingRoom_Sites.sh &> /dev/null

Waitingroom - Sites in waitingroom last 1/2/3 months

Script that checks if a site was in the waitingroom for the last 1/2/3 months.
location
GitHub https://github.com/CMSCompOps/MonitoringScripts/tree/master/Waitingroom_Dashboard/Waitingroom_SummedMetric
Afs /afs/cern.ch/user/c/cmst1/scratch0/Waitingroom_Dashboard/Waitingroom_SummedMetric/run_WaitingRoom_Sites.sh
Dashboard metrics Metric 154, 155, 156 : Waitingroom last 1/2/3 months
cmst1 cronjob 5,20,35,50 * * * * lxplus ssh vocms202 /afs/cern.ch/user/c/cmst1/scratch0/Waitingroom_Dashboard/Waitingroom_SummedMetric/run_WaitingRoom_Sites.sh &> /dev/null
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2014-02-19 - XavierJanssen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback