Tier0 Monitoring and alerting

Grafana monitoring overview

KibanaTier0MonitoringOverview

Include new scripts in the monitoring

The monitoring scripts are stored on each agent machine. You can check the list of the current T0 machines on T0 Replay and Production Account twiki. The scripts are located under the folder /data/tier0/sls, and keep the following structure:
Folder Description
scripts Location of the Monitoring Scripts
scripts/Logs Location of the Standard and Error Outputs of the Monitoring Scripts
data Output of the Monitoring Scripts

If you want to run a monitoring script as a cronjob, you should configure it in the cmst1 user crontab. As lxplus will be retired in the short/mid term, the use of the acrontab is discouraged.

To list the current cron entries you could use the flag -l:

   $ crontab -l
To edit the current cront entries use the flag -e. If the command leads you to a blank screen or does not show you anything, set the environment variable EDITOR with the editor that you want to use to edit the crontab:
   $ export EDITOR=vim
   $ crontab -e

Example: This is one of the remaining acrontab entries that executes the main shell script:

   # Tier0 Monitoring - Before any change, please contact: cms-tier0-operations@cern.ch
   */5 * * * * lxplus ssh vocms015 "/data/tier0/sls/scripts/runSLSAlarms.sh > /dev/null"

Example: This is the crontab entry to run the monitoring script on a replay machine:

   # Tier-0 - Monitoring for replays on this instance - (Responsible: Tier0-Ops <cms-tier0-operations@cern.ch>)
   */5 * * * * /data/tier0/sls/scripts/cmst0_replay_monitoring.sh &> /data/tier0/sls/scripts/cmst0_replay_monitoring.log

NOTE: If you add a new cronjob, please add a comment with the following information:

Create a new page

  • You need to create a new file with the json configuration of the page (obtained in Grafana GUI) and save it here:
    /afs/cern.ch/cms/monitoring/CMS/

Monitoring specification

The monitoring scripts (and some other utilities for operations) are available at the monitoring and alerting repository on Gitlab.

M1 Paused jobs

Description Get the number of paused jobs per workflow type (Express, Repack, PromptReco)
Source Production T0AST
Query used
SELECT cache_dir FROM wmbs_job where state = (SELECT id FROM WMBS_JOB_STATE WHERE name = 'jobpaused') and cache_dir like '%Express%';
SELECT cache_dir FROM wmbs_job where state = (SELECT id FROM WMBS_JOB_STATE WHERE name = 'jobpaused') and cache_dir like '%Repack%';
SELECT cache_dir FROM wmbs_job where state = (SELECT id FROM WMBS_JOB_STATE WHERE name = 'jobpaused') and cache_dir like '%PromptReco%';

M2 Use of areas of interest on eos

Description Get the (i) Total Quota, (ii) Used Quota, (iii) Percentage of use of the
Source eos client*
Query used

Base command is:

eos quota

Example**:

$(eos quota | grep -A 4 " /eos/cms/store/unmerged/" | tail -1)
(*) The eos client is available on lxplus but not in the Tier0 voboxes. Installation is not straight forward as there are packages conflicts, also cmst1 updated certificates are required.

(**) Monitored areas are

Area Usage User performing cleanup Cleanup Method
/eos/cms/store/t0streamer/ Storage Manager uploads the input Streamer files to this location cmsprod Cronjob:
/eos/cms/store/unmerged/ The output of the jobs is stored here until merge jobs are executed phedex Cronjob:
/eos/cms/store/express/ Output of the express workflows phedex Cronjob:
/eos/cms/tier0/ Output of Repack and PromptReco workflows T0 WMAgent PhEDExInjector creates deletion requests

Monitoring

Monitoring URL Description
Node summary page Click here Everywhere accessible short summary of action in all active T0 nodes (production, replay).
Active runs page Click here List of active runs and possible reasons why those runs are not closed.
OMS Run Summary Click here Details of the given run. Start and End times, Streams, PDs, Lumisections, etc.
ConfDB Click here HLT Menus and their contents
Storage Manager Click here Runs in the Storage Manager, status of transfers, tier0 check and Repack
WMStats Click here Production WMAgents, Information about the runs, workflows, paused jobs, crashed components, etc. Detailed Explanation
General Tier0 production monitoring Click here Runs divided by numbers, Progress of Express, Repack and PromptReco process
Tier0 plots Click here Transfers Backlog, Data Volume, etc. summarized in plots
GlideInWMS Monitoring Click here GlideInWMS for production
Tier0 Paused jobs details Click here Details about the paused jobs using Tier0 production monitoring
Tier0 Production monitoring Click here Use of Eos Areas, Agent Backlog
vocms047 Replay Instance Click here Running, idle, paused jobs and fileset information
vocms015 Replay Instance Click here Running, idle, paused jobs and fileset information
vocms0500 Replay Instance Click here Running, idle, paused jobs and fileset information

Alerting

ID Description Watchdog
Alert001 The number of paused jobs changes Cronjob on vocms0313/4 cmst1 crontab
Alert002 A component in the WMAgent crashes Cronjob on vocms0313/4 cmst1 crontab
Alert003 Blocks created are not cleaned up after a week Cronjob on vocms0313/4 cmst1 crontab
Alert004 The use fot he areas of interest on EOS reaches 95% Cronjob in lxplus cmst1 acrontab
Alert005 The transfer System goes down Cronjob on vocms001 cmst1 crontab
Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2019-10-01 - VytautasJankauskas
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback