The aim of this tutorial is to help the experiment specific service manager in monitoring WLCG experiment specific critical services.

The machines installed in the CERN Computing Centre are managed via Quattor and monitored via LeMon (LHC Era Monitoring). Quattor (http://www.quattor.org/ ) is a system administration toolkit that provides a set of tools for the automated installation, configuration and management of clusters and farms. LeMon (http://lemonweb.cern.ch/lemon-web/) is a client/server monitoring system. On every monitored node, a monitoring agent launches and communicates with sensors, which are responsible for retrieving monitoring information. The extracted samples are stored on a local cache and forwarded to a central Measurement Repository.

Some LeMon definitions

Terminology useful for this tutorial, reading highly recommended! show hide

Agent: Is a client part of LeMon monitoring system. It is responsible for communication with server and for retrieving informations from sensors. There should be only one agent on a machine.

Sensors: are separate processes connected to an agent running in LeMon monitored machine, via a bi-directional pipe. They live for the duration of the agents lifetime. They implement several Metric classes. There are typically many of them on one machines. Each sensors provides one or more metric classes. Sensors can (and are) written in different languages (perl, python, C++). In principle, there could be just one sensor implementing all available metric classes, but the reality is that different groups implemented different sensors in different languages.

Metric class: Is a base for metrics. It provides a way of monitoring a specific "service". For example - there are metric classes for retrieving information of the file (its creation time, size etc.), another metric class can count number of lines containing specific string etc.

Metric: it is a tool for checking the state of certain services with a set of configuration options. It is an instance of a metric class. It defines some of its parameters (ie. name of the file that should be checked). The benefit of this approach is that the same piece of code (metric class) can be instantiated many times on the single machine. On the machine metrics are defined in the lemon agent config file (/etc/lemon/agent/metrics/ ) but this file should not be modified by hand in Quattor managed machine. Every metric has its own id that must be unique within one machine, that means that you can not have metrics with the same id and different correlation field in the same machine.

Exception: it is a special metric, with unique id, that will launch some action (/*actuator*/) based on the information reported by one or multiple metrics. The metrics that want to be checked and how, will be specified in a field called /*correlation. */.

Alarm:: Depending on some parameters (silent, local and active) an exception can become an alarm. So if we want an exception to trigger an alarm, the following fields, "silent" and "silent" should be equal "false" and "active" equals "true". Alarms are (once enabled by lemon support people) then delivered to the operators in the Cern Computing Center. So it may aware the operator that something is wrong. Operators than try to solve to problem according the operators procedures (defined in the Operational Procedures Management portal). If the reason why the alarm was triggered is not solved, an ITCM ticket will be generated.

On the machine

Specific Tests

A specific test suite has been created for checking three critical services. It contains three independent tests which each one checks the status of one of the three services mentioned before (alice-box-proxyrenewal, Registration of the node inside myproxy server and Gsisshd).

The output of each one is parsed to collect the status of the services. So each test generates a log file, one per test, which is independent of each other, located in /var/log/voboxTest (vobox-test-1.result, vobox-test-2.result and vobox-test-3.result) where the result of each one is printed. To make it easier to read and more consistent with most of log files, it has the following structure:

 Timestamp (YY-MM-DD TT) 1; test_name = output 
Where “output” could be 0 if the test is OK or 1 if it fails.

Cron Job (if needed)

This step is not necessary in order to use Lemon to monitor a service but, in our particular case, as we needed to have a log for each daemon an it did not exist any yet, we had to write a script that checks the service and write in a log its status. So, as we want a continuous monitoring of the services inside the VOBOXES, the script have to run regularly via a crontab. We have to specified the frequency to execute the script, the location and where to redirect the output of the script (in that case to a log file called lcg_cern_vobox_tests.log)

         [root@voalice06 ~]# crontab -e
It will open a text file where we have to add a line similar to this one:
         Minutes           Location
         3,13,23,33,43,53 * * * * /root/lcg_cern_vobox_tests >> 
         var/log/voboxTest/ lcg_cern_vobox_tests.log 2>&1
which means that the script lcg_cern_vobox_tests locate in /root will be running six times per hour,every ten minutes. After modifying the file, check if it was saved correctly with the following command:
         [root@voalice06 ~]# crontab -l
         3,13,23,33,43,53 * * * * /root/lcg_cern_vobox_tests  >>                   var/log/voboxTest/lcg_cern_vobox_tests.log 2>&1

Quattor template

Modifications of the VOBOX templates

The VOBOXES are monitored at the T0 via the Lemon structure. Specific CBD templates for each VOBOX define the characteristics of the nodes including access, privileges, criticality, etc.

First of all is to create or to modify the template where you want to include the metrics.

ssh to lxvoadmin
Once logged in lxvoadm, we can access CDB Quattor Server by cdbop command
                       [lxvoadm01] /afs/cern.ch/user/l/lolass > cdbop
                       quattor CDB CLI: Version 2.2.0
                       Enter user-name (lolass): 
                       Enter password: 
                       Connecting to https://cdbserv...
                       Welcome to CDB Command Line Interface
                       Opening session...
                       [INFO] session opened with ID <hFlKhqQYK8>
                       Type 'help' for more info
                       <cdbop@cdbserv: ~>
Get and edit the file where you want to place the template that will include the metrics, in that particular case /prod/custimization/alice/. We can do this with get and vi command.
                       <cdbop@cdbserv: ~> get prod/customization/alice/
                      [INFO] 'prod/customization/alice/': rebuilding local empty directory
                      <cdbop@cdbserv: ~> !vi prod/customization/alice/pro_params_voalice_acl.tpl
In case that we have to create the template, we also have to use the add command in order to add it to the database
                      <cdbop@cdbserv: ~> add prod/customization/alice/pro_params_voalice_acl.tpl
                      [INFO] '/prod/customization/alice/pro_params_voalice_acl': scheduled to be added
If there is any problem with access or permissions contact: vobox.support@cernNOSPAMPLEASE.ch

Create the metrics

First of all you should check whether similar metric is not already defined by somebody else. For example, if you want to monitor that gridftpd daemon is running, you will discover that there are metrics for it already. To check it, do the following:
                       Connect to lxvoadm machine, and also connect to the cdbserv (see above), then:
                       <cdbop@cdbserv: ~> get -f /prod/pro_monitoring_*

This will download all ~600 already defined monitoring templates into the prod directory on local machine. These can be then greped for keywords ("gridftp" for example) to discover whether (and how) somebody was doing something similar. In our example, we discovered that there is already pro_monitoring_metrics_gridftp.tpl template defining the metric we wanted (with id 34).

To further check if the metric is enabled on a given machine, run this command on it:

ncm-query --pan --dump /system/monitoring/metric

If the similar metric is not found:

Now we want to edit the template in order to add the new three metrics. First, we have to ask for three metrics ids (lemon.support@cernNOSPAMPLEASE.ch) or reuse some of the same class (log.Parse) that you can find here (http://lemonweb.cern.ch/lemon-web/metric_list.php). Only reuse metric ids when you are completely sure about what you are doing and about all the consequences it could have. Also make sure that the same id is not used on the machine before (you should ALWAYS check it). Those three will be provided for the specific Alice tests:

  • 4034 : gsisshd_daemon_result
  • 4036 : VOBOX_Alice_Proxy Renewal
  • 4037 : VOBOX-Proxy-Server-Registration

Once you have picked the ids (4034,4036,4037), you must define some parameters of the metric as:

  • Name: the name of the metric
  • Desc: short description
  • Class: name of the method
  • Param: any parameters that need to be passed to the metric class
  • Period: time period for measuring the metric
  • Active: whether the metric is active or not
  • Latestonly: whether you want to keep the history or not. If not, say true here.
  • Smooth: ability to send metric outputs only when it changes. It saves some bandwidth. Consult http://lemon.web.cern.ch/lemon/doc/howto/lemon_cdb_howto.shtml for more info.
  • Local: whether the metric should be local only or whether the data should be reported to the Server

Example of the metric:

                      "/system/monitoring/metric/_4036" = nlist(
                         "name", "VOBOX_Alice_Proxy Renewal",
                         "descr", "Check Proxy Renewal test's output",
                         "class", "log.Parse",
                         "param", list("logfile","/var/log/voboxTest/vobox-test-1.result","istring",
                         "VOBOX-Proxy-Renewal_result=0","dformat","%Y-%m-%d %T","sincelast","30m"),
                         "period", 60,
                         "smooth", nlist("typeString", false, "maxdiff", 0.0, "maxtime", 600),
                         "active", true,
                         "latestonly", false,);
So, in that particular case the metric called VOBOX_Alice_Proxy Renewal will look for the following string “VOBOX-Proxy-Renewal_result=0” in a log file /var/log/voboxTest/vobox-test-1.result. with the timestamp format that we defined in the log file. To make sure that the log file is been renewed from time to time (for example, the script or cron job stops working), it will not only raise an exception if it does not find that string, but also if the last update is older than 30 minutes, the alarm will be raised -see the definition of corresponding exception below.

More information about the parameters of the log.Parse sensors can be found in : http://lemon.web.cern.ch/lemon/doc/sensors/parse-log.shtml

Create the Exception.

Still editing the template we have to create an exception. launching recovery actions (aka actuators) in response to detected problems. The metrics that want to be checked, will be specified in a field called correlation. You may want to check the terminology at the top of the page.

First, we must ask lemon.support@cernNOSPAMPLEASE.ch for an exception id. Then, write the exception itself. Here we have some important parameters that could be interesting to define:

  • name: name of the exception
  • descr: short description
  • active: whether exception should be active or not
  • latestonly: keep the history of exception values
  • importance: what is the alarm's importance?
    • 0 - informative
    • 1 - low - 9/5 support
    • 2 - high - 24/24 support
  • alarmtext: alarm text that operators would see on their screens
  • minoccurs: how many times the correlation has to be evaluated as true before running the actuator (beware that this point is wrongly explained at lemon documentation)
  • correlation: when the exception should actuate. In that case it check if the output of the tests is zero or not. In case that it is not zero, it will do what the actuator says.
  • actuator: what to do when the alarm is triggered (if active mode is specified : "active", true). If any of the conditions on the correlation fails, an exception is risen. If the actuator fails (it returned non zero value, or the exception still persist after its maxruns execution), the alarm is risen.

More info: http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml Be aware that some terms are not well used and can lead to some confussion

In our particular case we have exception id 30659 which will exception (and potentially an alarm) in case that

- cron job is not working - script is not running correctly - any of the file is missing

Here there is the example:

                      "/system/monitoring/exception/_30659" = nlist( 
                              "name", "alice_daemons", 
                             "descr", "Alice Daemon",   
                              "active", true, 
                              "latestonly", false, 
                              "importance", 1, 
                              "alarmtext", "alice_daemons",    
                              "local", false, 
                              "silent", false, 
                              "correlation", "((4036:1 <= 0) || (4037:1 <= 0) || (4034:1 <= 0))", 
                              "actuator", nlist("execve", "/bin/sh -c \\\\\" /bin/echo 'Lemon Sensor'| /bin/mail -s \\\\\\\"my VOBOX is not working properly\\\\\\\" lolass@cern.ch \\\\\" ",    
                                            "maxruns", 1, 
                                            "timeout", 1200, 
                                            "active", true) 
As we can see in the actuator, if any of the metrics fails, it will rise an exception (the exception is active) and temporally send an email to lolass@cernNOSPAMPLEASE.ch. It will also raise an alarm because the actuator will not solve the problem and the local and the silent parameter are “false”, which means that is not silent. When an alarm raises (and if it is enabled by lemon people), the operator will see a web with the id and name of the exception, the affected vobox and a link to a page with the procedure to follow.

Once the template is completely edited, we have to update and commit it :

                      <cdbop@cdbserv: ~> update prod/customization/alice/pro_params_voalice_acl.tpl
                      [INFO] '/prod/customization/alice/pro_params_voalice_acl': scheduled to be updated
                      <cdbop@cdbserv: ~> commit
                      [INFO] '/prod/customization/alice/pro_params_voalice_acl': will be updated
                      please confirm [yes]: yes
                      Last comment: adding exception
                      Press [Enter] to confirm the last comment or enter a new one.
                      Comment: adding exception vobox 
                      [INFO] please wait...
                      [INFO] commit OK
We have created/modified the metric template, but now we have to include this template in the machine's profile. So we have to get and edit the profile template of the vobox, as we did with the one with the metrics:
                 <cdbop@cdbserv: ~> get profiles/profile_voalice06.tpl
                 [INFO] 'profiles/profile_voalice06.tpl': received
                 <cdbop@cdbserv: ~> !vi profiles/profile_voalice06.tpl
Here we just to have to had a line with the template and the location of the metrics we have created:
                      include { 'customization/alice/pro_params_voalice_acl' }; 
Then update and commit, as we did before:
                      <cdbop@cdbserv: ~> update profiles/profile_voalice06.tpl
                      [INFO] '/profiles/profile_voalice06': scheduled to be updated
                      <cdbop@cdbserv: ~> commit
                      [INFO] '/profiles/profile_voalice06': will be updated
                      please confirm [yes]: yes
                      Last comment: 
                      Comment: adding metrics template to profile
                      [INFO] please wait...   
                      [INFO] commit OK
Once the templates are committed (it can take some time), we have to apply the changes in the vobox itself with the following command:
                      [root@voalice06 ~]# ccm-fetch; ncm-ncd --co fmonagent

Modify/Overwrite an existing metric/exception

via quattor you can overwrite some exception parameters in the customization tpl.

HowTo Check quattor and LeMon information

Check the configuration running on the machine

To check if the changes have been really applied to the machine, we can run
/ncm-query --dump /system
and look for the part we have changed.

Check the status of the metrics and exceptions

For checking the metrics that are running in the machine, we can find the logs in
. In that directory there the logs of all the metrics and exceptions running in the vobox with the following patter name and format : YY_MM_DD_MetricId
                      DATE                          ExcpState               Code          Info
                      1265933701                      0                       000           (null)
                      1265933761                      0                       000           (null)
                      1265933821                      1                       140           voatlas62:20002:1[48.48]_>_32
                      1265933881                      1                       140           voatlas62:20002:1[101.20]_>_32
                      1265933941                      1                       140           voatlas62:20002:1[135.00]_>_32
                      1265934001                      1                       140           voatlas62:20002:1[159.63]_>_32
                      1265934061                      1                       140           voatlas62:20002:1[167.68]_>_32
                      1265934121                      1                       140           voatlas62:20002:1[180.18]_>_32
                      1265934181                      1                       140           voatlas62:20002:1[185.72]_>_32
                      1265934241                      1                       140           voatlas62:20002:1[185.77]_>_32
                      1265934301                      1                       140           voatlas62:20002:1[179.32]_>_32
                      1265934361                      1                       140           voatlas62:20002:1[163.20]_>_32
                      1265934421                      1                       105           voatlas62:20002:1[146.65]_>_32
                      1265934421                      1                       110           voatlas62:20002:1[146.65]_>_32
                      1265934424                      1                       135           voatlas62:20002:1[137.84]_>_32
                      1265934481                      1                       135           voatlas62:20002:1[128.48]_>_32
                      1265934541                      1                       135           voatlas62:20002:1[107.42]_>_32
                      1265934601                      1                       135           voatlas62:20002:1[90.57]_>_32
                      1265934661                      1                       135           voatlas62:20002:1[83.21]_>_32
                      1265934721                      1                       135           voatlas62:20002:1[67.96]_>_32
                      1265934781                      1                       135           voatlas62:20002:1[45.92]_>_32
                      1265934841                      0                       000          (null)
Where the "Exception State" can be 0 = no exception, 1 = exception detected, -1 = error, 2 = disabled exception (not triggering alarms); the code tells what is the action that is being performed and the information is correlation of the exception. All the possible codes can be found at http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml

Retrieve LeMon metrics

The Lemon metrics are cached locally and then pushed/pulled on the Lemon server. It is possible to retrieve metrics results using the lemon-cli from e.g. lxplus:
lemon-cli -n "atlddm[11,29]" -m "[4031,4032,4033,4034]" --server --script-mode
lemon-cli -n "atlddm29" -m "[4032-4036,5210]" --server --script-mode --start 2010:02:18T07:10:01
lemon-cli -n "atlddm29,voatlas62" -m "[20002,30008]" --server --script-mode --start 2010:02:18T03:10:01

Alarms for the CERN Computing Centre

The alarms in CERN Computing Centre are handled by operators, who are generally unexperienced people in IT. Therefore exact procedures for what must be done once an alarm occurs must be given. These procedures are managed in Operational Procedures Management portal: http://service-cc-opm.web.cern.ch/service-cc-opm/cgi-bin/opm_overview.cgi. The workflow of the procedure definition is described in more details on the portal itself.

Tip: If you login and the portal still says that you are anonymous, try to reload the page (this is needed in Opera and Firefox).

LeMon alarms reaching the IT Operators

all the CC alarms LeMon operator alarms page with link to Ops Procedures

Check existing procedures

Use the search button on the procedures portal. This searches not only in the names of .html files in which procedures are described, but also in the content of these files.

Write a new procedure

Please consult manual on the portal and also procedures already in place. Also note that the exact alarm name must be given in procedure description in order to ensure that it will be connected with corresponding alarm in LAS.

The Service Level Status for the Experiment Services

SLS Examples: ADC Central Services LHCb Storage Space CMS PhEDEx Service ADC DDM VOBoxes

HowTo publish service level information into the SLS framework

declare a service in the service DB


the xml format

minimal xml update

the VO Specific Service Monitoring pckg

A tool to integrate LeMon metrics results, plus some other more specific service information, into SLS: VOSpecificServicesMon

Templates with common metrics

Template Metric Info
http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/pro_monitoring_metrics_wm 5240, 5241, 5242, 5243, 5244, 5245, 5246, 5247, 5248, 5249, 5251, 5252, 5253, 5254 http://lemonweb.cern.ch/lemon-web/metric_list.php
http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/pro_monitoring_metrics_system_state 5121, 5004, 20047, 20044, 5122, 20002, 20003, 5003, 5010, 5005, 5006, 5008, 5101, 5013 http://lemonweb.cern.ch/lemon-web/metric_list.php
http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/pro_monitoring_metrics_gridftp 27, 30, 34 http://lemonweb.cern.ch/lemon-web/metric_list.php
http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/pro_monitoring_metrics_apache 4019, 4021 http://lemonweb.cern.ch/lemon-web/metric_list.php
http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=prod/pro_monitoring_metrics_gridcert 807, 810 http://lemonweb.cern.ch/lemon-web/metric_list.php


CDBop Commands

Some commands that can be useful using CDBop:

  • !vi: to create or edit a new template
        <cdbop@cdbserv: ~> !vi prod/customization/alice/pro_params_voalice_acl.tpl
  • add: to add a new template to the database (first must have been created with vi)
        <cdbop@cdbserv: ~> add prod/customization/alice/pro_params_voalice_acl.tpl
  • get: if the template already exists, you need to "get" it in order to modify it. You can get the complete directory where the template is stored or just the template itself
        <cdbop@cdbserv: ~> get prod/customization/alice/
  • update: in order to commit a template, you have to update it first.
         <cdbop@cdbserv: ~>update prod/customization/alice/pro_params_voalice_acl.tpl
  • commit: commit the changes to cdb. The field "comment" is mandatory in order to complete the operation
         <cdbop@cdbserv: ~> get prod/customization/alice/

Shell commands can be used putting an exclamation sign before the command:

        <cdbop@cdbserv: ~> !vi prod/customization/alice/pro_params_voalice_acl.tpl

-- AleDiGGi - 16-Feb-2010 -- JiriHorky - 04-Mar-2010 -- JiriHorky - 18-Feb-2010

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2010-03-17 - SaizSantosLola
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback