TWiki
>
LCG Web
>
VomsServiceMonitor
(2008-03-27,
SteveTraylen
)
(raw view)
E
dit
A
ttach
P
DF
-- Main.dimou - 30 May 2006 ---++ VOM(R)S Service Monitoring tools on the CERN VOMS servers | *Tool* | *voms101/4 _alias voms.cern.ch_ ** | *voms105(normally lcg-voms.cern.ch)* | *voms106(normally voms-slave.cern.ch)* | *More info* | | voms-ping | Yes | Yes | Yes | Uses _voms-admin list_ commands to detect voms-admin problems and _voms-proxy-init_ to detect voms core problems. Generates operators alarms. See VomsWlcgHa and VomsPingScript. NB!!! It is incomplete. See [[https://savannah.cern.ch/bugs/?func=detailitem&item_id=19770][bug 19770]] | | LinuxHA takeover | Not needed | Yes | Yes | An ITCM ticket is created and Email is sent to VOMS.Support@cern.ch,Tim.Bell@cern.ch,Harry.Renshall@cern.ch whenever a switch occurs. There are two mails, one from the machine giving up the role of master and another from the machine taking over the role. | | TOMCAT_WRONG | Yes | Yes | Yes | It checks that there is at least one java process run by the user tomcat4. Raises operators' alarm if tomcat down. To dis(en)able type as root on the host: _lemon-host-check --disable=30055_ or _lemon-host-check --enable=30055_ | | vomrs-ping (not used yet, need to be fixed) | Not needed | Yes | Not appropriate when slave | It checks if all VOMRS server are up with ' _/opt/vomrs-1.3/etc/init.d/vomrs status_ ', and also that the WS interface (and so the applet in tomcat) of each VO is up by doing a request like this ' _/opt/vomrs-1.3/client/bin/vomrs_soapclient lcg-voms.cern.ch 8443 vo/dteam GetGroups _ '. Moreover it parses /var/log/vomrs/vomrs_<vo>.log to detect if all threads are well up. When it detects an error, it puts a line into /var/log/vomrs/vomrs-ping.alarm, and lemon raise an alarm.<br />The script must be run in background when a node becomes the master. See [[https://savannah.cern.ch/bugs/index.php?func=detailitem&item_id=19774][bug 19774]] | | Manual checks | Yes | Yes | Yes | All commands in VomsStartStopCheck can be typed on the hosts from anyone with root privileges | | mkgridmap-check | Not yet | Not yet | Not appropriate when slave | Raise an alarm when VO members can't be listed for gridmap file re-generation [[https://savannah.cern.ch/bugs/index.php?func=detailitem&item_id=19766][bug 19766]] | | InconsistentDatabase check | Yes | Yes | Yes | Lemon sensor parses voms-admin logs to look for inconsistencies in the voms DB. An alarm is raised when there are some. See the [[http://service-cc-opm.web.cern.ch/service-cc-opm/procedure/OP-PROC-vomsadmin-inconsistent-database-exception.html][ procedure for operators]]. | ---++ Lemon howto for VOMS monitoring For some particular tasks, using [[http://cern.ch/lemon][Lemon]] is a good idea in order to avoid implementing tricky monitors by oneself. For example, Lemon comes with a sensor for parsing log files that is easier to use than writing a script that does the text processing. [[http://lemon.web.cern.ch/lemon/docs.shtml][official Lemon documentation]]. ---+++ Configuration of new metrics A _sensor_ is a process that implements several _metric classes_. The documentation of the sensor should say which parameters a metric classes accepts (see, for example, the docs of the [[http://lemon.web.cern.ch/lemon/doc/sensors/linux.shtml][Linux sensor]]). The data is sampled by a _metric_, which is a instance of a metric class when actual values are passed as parameters. Metrics are defined in the Lemon agent config files (/etc/lemon/agent/metrics/*), but this files *should not* be modified by hand on Quattor-managed hosts. In order to configure and deploy a metric on a Quattor-managed host, the following documents are relevant: * [[http://lemon.web.cern.ch/lemon/doc/howto/lemon_cdb_howto.shtml][Procedure for writing CDB templates for Lemon monitoring]] * [[http://lemon.web.cern.ch/lemon/doc/sensor_metric_registration.shtml][Procedure for metric registration]] * [[FIOgroup.CDBMonitoringConfiguration][CDB monitoring configuration]] [TODO: detailed procedure] ---+++ Using active metrics and alarms There are two command-line programs, to be run as root on the VOMS servers: * [[http://lemon.web.cern.ch/lemon/doc/components/lemon-host-check.shtml][lemon-host-check]] shows whether there are active alarms. * [[http://lemon.web.cern.ch/lemon/doc/components/lemon-cli.shtml][lemon-cli]] shows the sampled values for the metrics for the host. For example, a script that needs to obtain the current value of metrics 5220 to 5224 can do as follows: <verbatim> [root@voms103 root]# lemon-cli -m '5220 5221 5222 5223 5224' </verbatim> ---+++ Lemon Alarm investigation (alarm written by R. Bonvallet, text by V.Lefebure) <verbatim> To investigate alarm name "N" on host "H", proceed as such: Let's take * alarm name = "voms-admin_inconsistent_database_exception" * Host = "voms103" 1) Go on LEMON host page for the host: http://lemonweb.cern.ch/lemon-status/info.php?host=voms103 2) from there, go to "LAS Alarm history" = http://lemonweb.cern.ch/lemon-status/las_alarms.php?host=voms103 3) Click on the alarm ID corresponding to the alarm called "voms-admin_inconsistent_database_exception" = http://lemonweb.cern.ch/lemon-status/las_alarm_detail.php?alarm_id=41038 &host=voms103 4) In the "History of alarm value" you find why the alarm was triggered: voms103:5220:1[0] > 0 || voms103:5221:1[0] > 0 || voms103:5222:1[1] > 0 || voms103:5223:1[0] > 0 || voms103:5224:1[0] > 0 5) Now you want to know what metric5220 to 5224 are. At least of them has a value <=0, which is why the alarm was raised. Go on "Metrics" (on the top of the page) = http://lemonweb.cern.ch/lemon-status/metric_descriptions.php 6) look for 5220 etc,... you find: voms-admin_alice_inconsistent_database <metric_info.php?metric=voms-admin_alice_inconsistent_database> 5220 log.Parse <metric_class_info.php?class=log.Parse> Y Count of inconsistency messages in alice voms-admin log voms-admin_atlas_inconsistent_database <metric_info.php?metric=voms-admin_atlas_inconsistent_database> 5221 log.Parse <metric_class_info.php?class=log.Parse> Y Count of inconsistency messages in atlas voms-admin log voms-admin_cms_inconsistent_database <metric_info.php?metric=voms-admin_cms_inconsistent_database> 5222 log.Parse <metric_class_info.php?class=log.Parse> Y Count of inconsistency messages in cms voms-admin log voms-admin_dteam_inconsistent_database <metric_info.php?metric=voms-admin_dteam_inconsistent_database> 5223 log.Parse <metric_class_info.php?class=log.Parse> Y Count of inconsistency messages in dteam voms-admin log voms-admin_lhcb_inconsistent_database <metric_info.php?metric=voms-admin_lhcb_inconsistent_database> 5224 log.Parse <metric_class_info.php?class=log.Parse> Y Count of inconsistency messages in lhcb voms-admin log 7) for from 4 and 6, you can expect that the alarm was triggered because there was an error message in the CMS log file. 8) on voms103, you can run "ncm-query --dump /system/monitoring | less", and look for 5222 (which had value 1 instead of 0). You find: +-_5222 $ active : (boolean) 'true' $ class : (string) 'log.Parse' $ descr : (string) 'Count of inconsistency messages in cms voms-admin log' $ latestonly : (boolean) 'true' $ name : (string) 'voms-admin_cms_inconsistent_database' +-param $ 0 : (string) 'logfile' $ 1 : (string) '/var/log/tomcat5/voms-admin.cms.log' $ 2 : (string) 'istring' $ 3 : (string) 'Internal database inconsistency' $ 4 : (string) 'estring' $ 5 : (string) '^[^\d]' $ 6 : (string) 'dformat' $ 7 : (string) '%F %T' $ 8 : (string) 'sincelast' $ 9 : (string) '15m' $ period : (long) '600' ---------> you know that the error message is "Internal database inconsistency" found in '/var/log/tomcat5/voms-admin.cms.log'. at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) </verbatim>
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r22
<
r21
<
r20
<
r19
<
r18
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r22 - 2008-03-27
-
SteveTraylen
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback