Tier0 Monitoring
The plots shown in the
CompOps meeting are queried from Dashboard by a cronjob that is installed in the cmst2 acrontab on lxplus.
# Tier0 CompOpsMeeting plots - cms-tier0-operations@cern.ch
0 0 * * * vocms049 `pwd`/tier0/tier0_resources_plots.sh
This script downloads the plots and leave them in a permanent location that is statically referenced from the meeting twiki. It means, that no changes are required in the Twiki week by week, as long as the plots are produced correctly, the latest plots will be shown always.
If the processing site is updated, this script has to be updated to point to the new one.
Kibana
Plots URLs
The scripts are running by the
cmst1 acrontab on lxplus
Adding a new monitoring page with a readable URL
Note that the last part of the URL in the previous table corresponds to the name of the file in the cmst1 monitoring folder without the extension, for example
For
https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::*tier0_047
* | /afs/cern.ch/cms/monitoring/CMS/*tier0_047*.json
...
#Tier0 SLS Alarms - luis89@fnal.gov, modified by jbadillo@cern.ch, modified by jocasall@cern.ch
*/5 * * * * lxplus ssh vocms015 "/data/tier0/sls/scripts/runSLSAlarms.sh > /dev/null"
*/5 * * * * lxplus ssh vocms015 "/data/tier0/sls/scripts/defragmentationTestingMonitoring.sh > /dev/null"
*/5 * * * * lxplus ssh vocms001 "/data/tier0/sls_vocms001/scripts/cmst0_replay_monitoring.sh > /dev/null"
*/5 * * * * lxplus ssh vocms047 "/data/tier0/sls/scripts/cmst0_replay_monitoring.sh > /dev/null"
*/5 * * * * lxplus ssh vocms015 "/data/tier0/sls_vocms015/scripts/cmst0_replay_monitoring.sh > /dev/null"
*/10 * * * * lxplus ssh vocms0313 "/data/tier0/jocasall/croneDiagnosis.sh > /dev/null"
Replays Monitoring Overview
There is a monitoring page per replay instance in the Tier0. Url format is:
https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::tier0_<node_number>
Six different plots can be found in this pages:
Using Lemon
Command to check value on lemon server
lemon-cli --metric="13107 13108" --nodes="CMST0-paused-jobs CMST0-wma-backlog CMST0-check-eos CMST0-late-workflows CMST0-long-jobs" --server
Check XML files consistency
xmllint --noout --schema
http://itmon.web.cern.ch/itmon/files/xsls_schema.xsd
/path/to/your.xml
Check the lemno log (Where the xml are being placed (vocms047))
tail /var/log/lemon-agent.log -f
Monitoring of the Tier0
As part of the monitoring of the Tier0 we have several sites and system we usually check
Current configuration
(last updated: March 2016)
General Overview
Scripts
Script |
Location |
OutputFile |
cmst0-wma-backlog |
:/data/tier0/sls/scripts/ |
cmst0_backlog_wma.xml |
cmst0_late_workflows |
:/data/tier0/sls/scripts/ |
cmst0_late_workflows.xml |
cmst0_long_jobs |
:/data/tier0/sls/scripts/ |
cmst0_long_jobs.xml |
cmst0_paused_jobs |
:/data/tier0/sls/scripts/ |
cmst0_paused_jobs.xml |
cmst0_eos_check_v2 |
:/afs/cern.ch/user/j/jocasall/public/ |
cmst0_eos_check.xml |
cmst0_eos_check_v2_t0streamer |
:/afs/cern.ch/user/j/jocasall/public/ |
cmst0_eos_check_t0Streamer.xml |
There are monitoring scripts producing XML files that are finally uploaded to
ElasticSearch. Kibana queries them from there to produce the plots.
XML output files
Each script has one or several outputs. The mapping between the output files and the scripts is described in the following table
cmst0-late-workflows-detailed
- numExpress: Number of Express jobs paused
- numRepack: Number of Repack jobs paused
- numPromptReco: Number of PromptReco jobs paused
cmst0-long-jobs
- longJobs:
- longJobsExpress:
- longJobsRepack:
- longJobsPromptReco:
- longJobsLowThreshold:
- longJobsMidThreshold:
- longJobsHighThreshold:
cmst0-paused-jobs
- pausedJobs:
- pausedJobsExpress:
- pausedJobsRepack:
- pausedJobsPromptReco:
cmst0-check-eos-t0Streamer
- t0StreamerTotal:
- t0StreamerUsedSpace:
- t0StreamerAvailableSpace:
- t0StreamerPercentage:
cmst0-check-eos
- unmergedTotal:
- unmergedUsedSpace:
- unmergedAvailableSpace:
- unmergedPercentage:
cmst0-wma-backlog
- promptrecoCount:
- repackCount:
- expressCount:
--
JohnHarveyCasallasLeon - 2015-04-21