Tier0 Monitoring

CompOpsMeeting plots

The plots shown in the CompOps meeting are queried from Dashboard by a cronjob that is installed in the cmst2 acrontab on lxplus.

# Tier0 CompOpsMeeting plots - cms-tier0-operations@cern.ch
0 0 * * * vocms049 `pwd`/tier0/tier0_resources_plots.sh

This script downloads the plots and leave them in a permanent location that is statically referenced from the meeting twiki. It means, that no changes are required in the Twiki week by week, as long as the plots are produced correctly, the latest plots will be shown always.

If the processing site is updated, this script has to be updated to point to the new one.

Kibana

Plots URLs

Monitoring URL Source
Tier0 Production monitoring https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::tier0 /afs/cern.ch/cms/monitoring/CMS/tier0.json
vocms047 Replay Instance https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::tier0_047 /afs/cern.ch/cms/monitoring/CMS/tier0_047.json
vocms015 Replay Instance https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::tier0_015 /afs/cern.ch/cms/monitoring/CMS/tier0_015.json
vocms001 Replay Instance https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::tier0_001 /afs/cern.ch/cms/monitoring/CMS/tier0_001.json
The scripts are running by the cmst1 acrontab on lxplus

Adding a new monitoring page with a readable URL

Note that the last part of the URL in the previous table corresponds to the name of the file in the cmst1 monitoring folder without the extension, for example For https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::*tier0_047* | /afs/cern.ch/cms/monitoring/CMS/*tier0_047*.json

...
#Tier0 SLS Alarms - luis89@fnal.gov, modified by jbadillo@cern.ch, modified by jocasall@cern.ch
*/5 * * * * lxplus ssh vocms015 "/data/tier0/sls/scripts/runSLSAlarms.sh > /dev/null"
*/5 * * * * lxplus ssh vocms015 "/data/tier0/sls/scripts/defragmentationTestingMonitoring.sh > /dev/null"
*/5 * * * * lxplus ssh vocms001 "/data/tier0/sls_vocms001/scripts/cmst0_replay_monitoring.sh > /dev/null"
*/5 * * * * lxplus ssh vocms047 "/data/tier0/sls/scripts/cmst0_replay_monitoring.sh > /dev/null"
*/5 * * * * lxplus ssh vocms015 "/data/tier0/sls_vocms015/scripts/cmst0_replay_monitoring.sh > /dev/null"
*/10 * * * * lxplus ssh vocms0313 "/data/tier0/jocasall/croneDiagnosis.sh > /dev/null"

Replays Monitoring Overview

There is a monitoring page per replay instance in the Tier0. Url format is:

https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::tier0_<node_number>

Six different plots can be found in this pages:

Using Lemon

Command to check value on lemon server

lemon-cli --metric="13107 13108" --nodes="CMST0-paused-jobs CMST0-wma-backlog CMST0-check-eos CMST0-late-workflows CMST0-long-jobs" --server

Check XML files consistency

xmllint --noout --schema http://itmon.web.cern.ch/itmon/files/xsls_schema.xsd /path/to/your.xml

Check the lemno log (Where the xml are being placed (vocms047))

tail /var/log/lemon-agent.log -f

Monitoring of the Tier0

As part of the monitoring of the Tier0 we have several sites and system we usually check

Current configuration

(last updated: March 2016)

General Overview

g8009.png

Scripts

Script Location OutputFile
cmst0-wma-backlog :/data/tier0/sls/scripts/ cmst0_backlog_wma.xml
cmst0_late_workflows :/data/tier0/sls/scripts/ cmst0_late_workflows.xml
cmst0_long_jobs :/data/tier0/sls/scripts/ cmst0_long_jobs.xml
cmst0_paused_jobs :/data/tier0/sls/scripts/ cmst0_paused_jobs.xml
cmst0_eos_check_v2 :/afs/cern.ch/user/j/jocasall/public/ cmst0_eos_check.xml
cmst0_eos_check_v2_t0streamer :/afs/cern.ch/user/j/jocasall/public/ cmst0_eos_check_t0Streamer.xml
There are monitoring scripts producing XML files that are finally uploaded to ElasticSearch. Kibana queries them from there to produce the plots.

XML output files

Each script has one or several outputs. The mapping between the output files and the scripts is described in the following table

cmst0-late-workflows-detailed
  • numExpress: Number of Express jobs paused
  • numRepack: Number of Repack jobs paused
  • numPromptReco: Number of PromptReco jobs paused
cmst0-long-jobs
  • longJobs:
  • longJobsExpress:
  • longJobsRepack:
  • longJobsPromptReco:
  • longJobsLowThreshold:
  • longJobsMidThreshold:
  • longJobsHighThreshold:
cmst0-paused-jobs
  • pausedJobs:
  • pausedJobsExpress:
  • pausedJobsRepack:
  • pausedJobsPromptReco:
cmst0-check-eos-t0Streamer
  • t0StreamerTotal:
  • t0StreamerUsedSpace:
  • t0StreamerAvailableSpace:
  • t0StreamerPercentage:
cmst0-check-eos
  • unmergedTotal:
  • unmergedUsedSpace:
  • unmergedAvailableSpace:
  • unmergedPercentage:
cmst0-wma-backlog
  • promptrecoCount:
  • repackCount:
  • expressCount:

-- JohnHarveyCasallasLeon - 2015-04-21

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng g8009.png r1 manage 28.5 K 2015-07-02 - 13:23 JohnHarveyCasallasLeon Kibana Overview
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2016-04-27 - JohnHarveyCasallasLeon
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback