Tool | voms101 alias voms.cern.ch * | voms102(normally lcg-voms.cern.ch) | *voms103(normally voms-slave.cern.ch) | More info |
---|---|---|---|---|
voms-ping | Yes | Yes | Yes | Uses voms-admin list commands to detect voms-admin problems and voms-proxy-init to detect voms core problems. Generates operators alarms. See VomsWlcgHa and VomsPingScript. NB!!! It is incomplete. See bug 19770![]() |
LinuxHA takeover | Not needed | Yes | Yes | An ITCM ticket is created and Email is sent to VOMS.Support@cernNOSPAMPLEASE.ch,Tim.Bell@cern.ch,Harry.Renshall@cern.ch whenever a switch occurs. There are two mails, one from the machine giving up the role of master and another from the machine taking over the role. |
TOMCAT_WRONG | Yes | Yes | Yes | It checks that there is at least one java process run by the user tomcat4. Raises operators' alarm if tomcat down. To dis(en)able type as root on the host: lemon-host-check --disable=30055 or lemon-host-check --enable=30055 |
vomrs-ping | Not needed | Yes |
Not appropriate when slave | It checks if all VOMRS server are up with ' /opt/vomrs-1.3/etc/init.d/vomrs status ', and also that the WS interface (and so the applet in tomcat) of each VO is up by doing a request like this ' _/opt/vomrs-1.3/client/bin/vomrs_soapclient lcg-voms.cern.ch 8443 vo/dteam GetGroups _ '. Moreover it parses /var/log/vomrs/vomrs_<vo>.log to detect if all threads are well up. When it detects an error, it puts a line into /var/log/vomrs/vomrs-ping.alarm, and lemon raise an alarm. The script must be run in background when a node becomes the master. See bug 19774 ![]() |
Manual checks | Yes | Yes | Yes | All commands in VomsStartStopCheck can be typed on the hosts from anyone with root privileges |
mkgridmap-check | Not yet | Not yet | Not appropriate when slave | Raise an alarm when VO members can't be listed for gridmap file re-generation bug 19766![]() |
InconsistentDatabase check | Yes | Yes | Yes | Lemon sensor parses voms-admin logs to look for inconsistencies in the voms DB. An alarm is raised when there are some. See the procedure for operators![]() |
voms-check |
Not yet | Not needed |
Not needed |
Uses voms-ping to detect voms-admin problems and sends email (to configure recipients' list). Used only on voms101 as voms-ping is checked by linuxHA on voms102 and voms103. *Pending actions: *Fix the path in bug 19771 ![]() |
voms-maint (obsolete) | Yes | Yes | Yes | Restarts tomcat from /etc/cron.d/voms-maint ( rpm in VomsCernSetup due to bug 16843![]() ![]() |
Tomcat memory usage (obsolete) | Yes | Yes | Yes | Script (/root/monitor/memory.sh) measures memory usage stats of the Tomcat process. The idea is to relate memory usage with the outOfMemory error, once the latter happens again. When bug 20800![]() |
[root@voms103 root]# lemon-cli -m '5220 5221 5222 5223 5224'
To investigate alarm name "N" on host "H", proceed as such: Let's take * alarm name = "voms-admin_inconsistent_database_exception" * Host = "voms103" 1) Go on LEMON host page for the host: http://lemonweb.cern.ch/lemon-status/info.php?host=voms103 2) from there, go to "LAS Alarm history" = http://lemonweb.cern.ch/lemon-status/las_alarms.php?host=voms103 3) Click on the alarm ID corresponding to the alarm called "voms-admin_inconsistent_database_exception" = http://lemonweb.cern.ch/lemon-status/las_alarm_detail.php?alarm_id=41038 &host=voms103 4) In the "History of alarm value" you find why the alarm was triggered: voms103:5220:1[0] > 0 || voms103:5221:1[0] > 0 || voms103:5222:1[1] > 0 || voms103:5223:1[0] > 0 || voms103:5224:1[0] > 0 5) Now you want to know what metric5220 to 5224 are. At least of them has a value <=0, which is why the alarm was raised. Go on "Metrics" (on the top of the page) = http://lemonweb.cern.ch/lemon-status/metric_descriptions.php 6) look for 5220 etc,... you find: voms-admin_alice_inconsistent_database <metric_info.php?metric=voms-admin_alice_inconsistent_database> 5220 log.Parse <metric_class_info.php?class=log.Parse> Y Count of inconsistency messages in alice voms-admin log voms-admin_atlas_inconsistent_database <metric_info.php?metric=voms-admin_atlas_inconsistent_database> 5221 log.Parse <metric_class_info.php?class=log.Parse> Y Count of inconsistency messages in atlas voms-admin log voms-admin_cms_inconsistent_database <metric_info.php?metric=voms-admin_cms_inconsistent_database> 5222 log.Parse <metric_class_info.php?class=log.Parse> Y Count of inconsistency messages in cms voms-admin log voms-admin_dteam_inconsistent_database <metric_info.php?metric=voms-admin_dteam_inconsistent_database> 5223 log.Parse <metric_class_info.php?class=log.Parse> Y Count of inconsistency messages in dteam voms-admin log voms-admin_lhcb_inconsistent_database <metric_info.php?metric=voms-admin_lhcb_inconsistent_database> 5224 log.Parse <metric_class_info.php?class=log.Parse> Y Count of inconsistency messages in lhcb voms-admin log 7) for from 4 and 6, you can expect that the alarm was triggered because there was an error message in the CMS log file. 8) on voms103, you can run "ncm-query --dump /system/monitoring | less", and look for 5222 (which had value 1 instead of 0). You find: +-_5222 $ active : (boolean) 'true' $ class : (string) 'log.Parse' $ descr : (string) 'Count of inconsistency messages in cms voms-admin log' $ latestonly : (boolean) 'true' $ name : (string) 'voms-admin_cms_inconsistent_database' +-param $ 0 : (string) 'logfile' $ 1 : (string) '/var/log/tomcat5/voms-admin.cms.log' $ 2 : (string) 'istring' $ 3 : (string) 'Internal database inconsistency' $ 4 : (string) 'estring' $ 5 : (string) '^[^\d]' $ 6 : (string) 'dformat' $ 7 : (string) '%F %T' $ 8 : (string) 'sincelast' $ 9 : (string) '15m' $ period : (long) '600' ---------> you know that the error message is "Internal database inconsistency" found in '/var/log/tomcat5/voms-admin.cms.log'. at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)