LHCb Operational Procedures and Maintainance
Summary:
We use Lemon to monitor critical services and machine states which proved to be point of failure in the past. There is only one metric for all DIRAC services and agents. Detailed description of the metric could be find in its own
section.
The list of metrics and corresponding exceptions follows.
List of metrics and exceptions:
Metrics table
This table is also part of the [[ExperimentLemonMetrics][list] of experiment-specific metrics for all experiments.
Exceptions table
Metric descriptions
Some of the used metrics deserve more comment to fully understand what is happening behind the scene.
DIRAC's Lemon Agent
DIRAC's Lemon Agent is the part of the DIRAC Framework systems that was written to enable Lemon monitoring of all DIRACs' services and agents. In this paragrapg the
Lemon Agent means part of the DIRAC framework, not the part of the Lemon monitoring system itself. This agent is written in a generic way --> there is only one Lemon Agent running on each host, which is same for all machines. The agent monitors all installed and setup services/agents on the machine. It gets the list from the local configuration (walking through the DIRAC's directory system). According to the criticality of the agent/service (see
below), it outputs the status of the service/agent to a local log file (
/opt/dirac/runit/Framework/LemonAgent/log/current
) (which is automatically rotated by DIRAC). Then an independt TODO NAME Lemon metric regularly parses the log file and looks for failing critical services/agents.
This approach has many benefits, namely:
- only one generic agent for every host
- no need to define list of agents/services on hosts
- it gets installed automatically with DIRAC
- the log file is rotated for free
- no need to hassle with certificates
Criticality definition:
Every agent/service has defined its criticality. This criticality is defined per system (Production/Development) in Configuration System component. They can be redefined using local configuration file of the Lemon Agent (
/opt/dirac/etc/FrameworkSystem_LemonAgent.cfg
). There are only two levels of criticality right now:
Critical and
NonCritical. Lemon metric only checks failing
Critical services/agents.
Procedures:
Given the fact that there is only one metric on one machines where may be running several services/agents, one has to login to the machine to see which service/agent is actually failing. This is true for operators in CCC, which are instructed to do so before contacting responsible persons. When the exception is first risen, the email containing this information is automatically sent from the machine before any alarm appears (the alarm is only risen after given number of consequentially failures).
Requirements:
Our own version of lemon-parse-log sensor which can handle UTC strings in it:
/afs/cern.ch/user/j/jhorky/public/sl5/lemon-sensor-parse-log-1.1.2r1-1.x86_64.rpm
Samclient monitoring
Monitoring of sam test submission system is done in the same way as on samXXX machines managed by SAM team. While the SAM is going to be retired, they do not accept any new VO on their machines --> we deployed the same sensor on our machine.
Sam test submission is monitored using samclient sensor, which must be installed by hand. There is only version for 32bit SL4, but the sensor is written in perl, so it should not matter which architecture it is deployed on (in fact, right now it is running on 64bit volhcb05 without problem). The lack of rpm in the sl5/64bit repository is the reason why it can't be installed using Quattor's directive.
Used samclient.lockfile metric class monitors age of all lockfiles inside .same directories under given parent directory (/home/santinel for volhcb05). It outputs its age and corresponding exception then checks whether it is older than set threshold. See templates linked from tables above.
Requirements:
The sensor must be installed be hand. It can be found here:
http://swrep.cern.ch/swrep/i386_slc4/lemon-sensor-samclient-1.0.0-7.noarch.rpm
DIRAC certificate validity
For some reason, more than one instance of FIO::CertOK on one host causes problems - if fetching information from both instance simultaneously (at the exact same time) one of the instance don't get updated. As workaround, we set the frequency of the two sensors to two distinct prime numbers aroung 5hours each, which means that the collision only occurs once per 10years.
Strange, but it works.
There is, however, still one unresolved thing. Running
lemon-host-check
will try to get fresh information from all metrics to see whether there is an exception or not. As this will cause the "collision" of the metrics and thus will result in output like this:
root@volhcb20 ~/ >/usr/sbin/lemon-host-check
[INFO] lemon-host-check version 1.3.3 started by root at Tue May 18 15:46:36 2010 on volhcb20.cern.ch
[VERB] 30660
[VERB] Name: exception.VOBOX_LHCb_DIRAC_Cert_valid_err (dirac-host-certificate-expiring)
[VERB] Reason: (null)
[VERB] Notes: possible false exception (cacheAll: enabled)
[VERB] Exceptions: 1 - Running actuators: 0 - Disabled exceptions: 0 - State: Production
The
possible false exception
here means that it wasn't able to retrieve status of corresponding metric. However, when
server asks for data, he uses the same source of information as
lemon-cli
application and thus
does not suffer from this error.
Requirements:
="/system/monitoring/metric/_810/period" = 17761;
DIRAC certificate mode and owner check
There is a bug in mainstream version of the lemon-sensor-file (it returns non-sense values for mode and also for group id of given file). Our own version must be installed. See requirements.
Requirements:
One must install our own version of the lemon-sensor-parase-log:
/afs/cern.ch/user/j/jhorky/public/sl5/lemon-sensor-parse-log-1.1.2r1-1.x86_64.rpm
Procedures:
Procedures for operators in CCC are maintained in the
Operational Procedures Management Portal
. Existing procedures for the exceptions are listed in the table above, but it can be always found by searching the portal for the exception name (e.g. try to search for gridftp_wrong, you will get three pages - but only one of them is dedicated to VOBOXes). Everybody can try to change existing procedure, but it is a validator of the page who decides whether the changes will take effect or not.
It is always possible to create a new procedure page. In that case, it gets checked by a chosen validator and also by somebody from CC operational team.
Maintaining of existing metrics:
It is highly recommended to consult the
dedicated tutorial page. There is also
general official documentation
for Lemon system and for some
sensors
too. Please note that the official documentation may not be fully up to date.
Here we list specific information concerning our own metrics.
Migrating to a different host
First of all check if it is not already enabled on the host, using [[Convenient_commands][right command]. If it is not, find the template from which is given monitoring included. Right now, it is done in the root template of the machine (e.g.
http://tpl-viewer.cern.ch/cdb-tpl-view/tpl_view.php?profile=profiles/profile_volhcb15
), but it may change (dirac common monitoring will probably be moved to one of the dirac templates). Be sure to read every comments concerning the metric, some of them require proper handling. Then, using a
cdbop
command (see
tutorial), edit a target machine's template and include the required lines. Also comment out or delete corresponding lines in the previous machine template. Update and commit.
Do not forget to update the list of metrics/exceptions on this page and also on [[ExperimentLemonMetrics][page for all experiments].
You should also connect to the machine and make sure, that the changes you have done are properly propagated (see [[VOSpecificServicesMon_Tutoria][this] and
this).
New metrics and exceptions
Right now, we have dedicated range of metric IDs (4060-4090) and expcetion IDs (30660-30690). Half of them are already used, mainly by non-active
VOBOX_LHCb_DIRAC_[service]_log_[time,content]_check
metrics, which we decided to abandon. These could be reused in future after contacting lemon support (
lemon.support@cernNOSPAMPLEASE.ch) and deleting the templates (
/prod/customization/lhcb/vobox/pro_monitoring_metrics_dirac_[service]
).
Every time a new metric/exception is created, one has to contact lemon support, so they can update metric list (otherwise, it will not be listed on the web).
Removing
Is very similar to
migrating. Just don't add the lines to the another machine's profile. Once again, do not forget to update the lists (see above).
Disabling
It can be done in two different ways. First one is to comment out corresponding include directives in a given machine's profile template (in case it is included from within the profile template and you want to disable it only for a given host). In case it is included from elsewhere (e.g. in some common DIRAC template, which is used by every vobox), you should NOT disable it in the template itself (unless you want to disable it globally). The correct way to do this is to add similar lines to machine's profile file:
"/system/monitoring/metric/_27/active" = false;
"/system/monitoring/exception/_30054/active" = false;
Make sure you specify these options AFTER the metric is included (otherwise these options will get overwritten).
Once again, do not forget to update the lists (see above).
LHCb specific notes:
Some of the metrics require installing of our own version of sensors. These versions fixes some bugfixes and also provides some additional functionality that is listed below. When installing RPMs by hand, it is also needed to instruct quattor to not replace them, using following directives in the templates:
"/software/components/spma/userpkgs" = "yes";
"/software/components/spma/userprio" = "yes";
File sensor:
Log.Parse sensor
The RPMs are available at
/afs/cern.ch/user/j/jhorky/public/sl{4,5}/
Convenient commands:
Checking settings of all metrics and exceptions on the host:
ncm-query --pan --dump /system/monitoring
--
JiriHorky - 17-May-2010