-- Main.dimou - 30 May 2006

VOM(R)S Service Monitoring tools on the CERN VOMS servers

Tool voms101/4 alias voms.cern.ch * voms105(normally lcg-voms.cern.ch) voms106(normally voms-slave.cern.ch) More info
voms-ping Yes Yes Yes Uses voms-admin list commands to detect voms-admin problems and voms-proxy-init to detect voms core problems. Generates operators alarms. See VomsWlcgHa and VomsPingScript. NB!!! It is incomplete. See bug 19770
LinuxHA takeover Not needed Yes Yes An ITCM ticket is created and Email is sent to VOMS.Support@cernNOSPAMPLEASE.ch,Tim.Bell@cern.ch,Harry.Renshall@cern.ch whenever a switch occurs. There are two mails, one from the machine giving up the role of master and another from the machine taking over the role.
TOMCAT_WRONG Yes Yes Yes It checks that there is at least one java process run by the user tomcat4. Raises operators' alarm if tomcat down. To dis(en)able type as root on the host: lemon-host-check --disable=30055 or lemon-host-check --enable=30055
vomrs-ping (not used yet, need to be fixed) Not needed Yes Not appropriate when slave It checks if all VOMRS server are up with ' /opt/vomrs-1.3/etc/init.d/vomrs status ', and also that the WS interface (and so the applet in tomcat) of each VO is up by doing a request like this ' _/opt/vomrs-1.3/client/bin/vomrs_soapclient lcg-voms.cern.ch 8443 vo/dteam GetGroups _ '. Moreover it parses /var/log/vomrs/vomrs_<vo>.log to detect if all threads are well up. When it detects an error, it puts a line into /var/log/vomrs/vomrs-ping.alarm, and lemon raise an alarm.
The script must be run in background when a node becomes the master. See bug 19774
Manual checks Yes Yes Yes All commands in VomsStartStopCheck can be typed on the hosts from anyone with root privileges
mkgridmap-check Not yet Not yet Not appropriate when slave Raise an alarm when VO members can't be listed for gridmap file re-generation bug 19766
InconsistentDatabase check Yes Yes Yes Lemon sensor parses voms-admin logs to look for inconsistencies in the voms DB. An alarm is raised when there are some. See the procedure for operators.

Lemon howto for VOMS monitoring

For some particular tasks, using Lemon is a good idea in order to avoid implementing tricky monitors by oneself. For example, Lemon comes with a sensor for parsing log files that is easier to use than writing a script that does the text processing. official Lemon documentation.

Configuration of new metrics

A sensor is a process that implements several metric classes. The documentation of the sensor should say which parameters a metric classes accepts (see, for example, the docs of the Linux sensor). The data is sampled by a metric, which is a instance of a metric class when actual values are passed as parameters. Metrics are defined in the Lemon agent config files (/etc/lemon/agent/metrics/*), but this files should not be modified by hand on Quattor-managed hosts.

In order to configure and deploy a metric on a Quattor-managed host, the following documents are relevant:

[TODO: detailed procedure]

Using active metrics and alarms

There are two command-line programs, to be run as root on the VOMS servers: For example, a script that needs to obtain the current value of metrics 5220 to 5224 can do as follows:
[root@voms103 root]# lemon-cli -m '5220 5221 5222 5223 5224'

Lemon Alarm investigation (alarm written by R. Bonvallet, text by V.Lefebure)

To investigate alarm name "N" on host "H",  proceed as such:

Let's take 
* alarm name = "voms-admin_inconsistent_database_exception"
* Host = "voms103"

1) Go on LEMON host page for the host:
 2) from there, go to "LAS Alarm history" =
 3) Click on the alarm ID corresponding to the alarm called
 4) In the "History of alarm value" you find why the alarm was triggered:
voms103:5220:1[0] > 0 || voms103:5221:1[0] > 0 || voms103:5222:1[1] > 0
|| voms103:5223:1[0] > 0 || voms103:5224:1[0] > 0
 5) Now you want to know what metric5220 to 5224 are. At least of them
 has a value <=0, which is why the alarm was raised.
Go on "Metrics" (on the top of the page) =
 6) look for 5220 etc,... you find:
<metric_info.php?metric=voms-admin_alice_inconsistent_database>  5220
 log.Parse <metric_class_info.php?class=log.Parse>  Y Count of
 inconsistency messages in alice voms-admin log 
<metric_info.php?metric=voms-admin_atlas_inconsistent_database>  5221
 log.Parse <metric_class_info.php?class=log.Parse>  Y Count of
 inconsistency messages in atlas voms-admin log 
<metric_info.php?metric=voms-admin_cms_inconsistent_database>  5222
 log.Parse <metric_class_info.php?class=log.Parse>  Y Count of
 inconsistency messages in cms voms-admin log 
<metric_info.php?metric=voms-admin_dteam_inconsistent_database>  5223
 log.Parse <metric_class_info.php?class=log.Parse>  Y Count of
 inconsistency messages in dteam voms-admin log 
<metric_info.php?metric=voms-admin_lhcb_inconsistent_database>  5224
 log.Parse <metric_class_info.php?class=log.Parse>  Y Count of
 inconsistency messages in lhcb voms-admin log

7) for from 4 and 6, you can expect that the alarm was triggered because
 there was an error message in the CMS log file.

8) on voms103, you can run "ncm-query --dump /system/monitoring | less",
and look for 5222 (which had value 1 instead of 0).
You find:
      $ active : (boolean) 'true'
      $ class : (string) 'log.Parse'
      $ descr : (string) 'Count of inconsistency messages in cms
 voms-admin log'
      $ latestonly : (boolean) 'true'
      $ name : (string) 'voms-admin_cms_inconsistent_database'
        $ 0 : (string) 'logfile'
        $ 1 : (string) '/var/log/tomcat5/voms-admin.cms.log'
        $ 2 : (string) 'istring'
        $ 3 : (string) 'Internal database inconsistency'
        $ 4 : (string) 'estring'
        $ 5 : (string) '^[^\d]'
        $ 6 : (string) 'dformat'
        $ 7 : (string) '%F %T'
        $ 8 : (string) 'sincelast'
        $ 9 : (string) '15m'
      $ period : (long) '600'

---------> you know that the error message is "Internal database
 inconsistency" found in '/var/log/tomcat5/voms-admin.cms.log'.

This topic: LCG > VomsServiceMonitor
Topic revision: r22 - 2008-03-27 - SteveTraylen
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback