This page describes how to maintain the LHCbSAM probes used with Nagios within
ETF
. The results of these probes are used to generate the monthly WLCG site Availability & Reliability reports, and are visible on the
SAM3 Dashboard
and in a more raw format on the
etf-lhcb-prod Check_MK dashboard
. There is also an
etf-lhcb-preprod version of the Check_MK dashboard
for testing.
References to "ask the ETF team" currently mean "ask Marian Babik".
LHCb ETF Machines
LHCb has access to 2 machines which are running ETF, configured for LHCb: etf-lhcb-preprod.cern.ch and etf-lhcb-prod.cern.ch The first machine is used for testing with the "QA" (= Quality Assurance) version of the service. Ask the ETF team for root access to these machines.
LHCbSAM in gitlab.cern.ch
LHCb maintains the
LHCbSAM
repository in
GitLab which is used to build the RPM nagios-plugins-wlcg-org.lhcb which is installed on the ETF machines (preprod first for testing of course.) This repository has a rather deep directory structure:
LHCbSAM/Makefile
LHCbSAM/Makefile.koji
LHCbSAM/README.md
LHCbSAM/VERSION
LHCbSAM/nagios-plugins-wlcg-org.lhcb.spec
LHCbSAM/usr/lib/ncgx/x_plugins/lhcb_vofeed.py
LHCbSAM/usr/lib/ncgx/x_plugins/lhcb_webdav.py
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/check_pilot_results
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/srmvometrics.py
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb.gridJob.jdl.template
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/etc/wn.d/org.lhcb/commands.cfg
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/etc/wn.d/org.lhcb/services.cfg
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/SRM-lhcb-FileAccess
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-cvmfs
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-lhcb-FileAccess
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-mjf
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-sft-brokerinfo
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-sft-csh
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-sft-lcg-rm-gfal
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-sft-vo-swdir
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-sft-voms
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-vo-id-card
The lhcb_vofeed.py script takes the
XML LHCb VO Feed
and generates pieces of configuration required by Nagios from it. The VO Feed format is specified by WLCG and
documented here.
There are three classes of probe which are run by Nagios from the ETF LHCb machine:
- Probes in LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/ which are executed on the ETF LHCb machine
- Probes in LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/ run on Worker Nodes inside grid jobs submitted by ETF
- Probes in LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/ run by the DIRAC Pilot 3.0 in VMs (currently) and on grid Worker Nodes (not yet).
The current version is given by the VERSION file and
embedded in the nagios-plugins-wlcg-org.lhcb.spec file which is used during the remote build by Koji. It is currently essential that both are kept in sync!
Building the RPM with Koji
It would be possible to build the nagios-plugins-wlcg-org.lhcb RPM manually and install it on the ETF LHCb machines. However, these machines may be regenerated from scratch with no/short notice and so the RPMs must be built by Koji and installed by a YUM update.
There is an
introduction to Koji. Before doing anything, follow its instructions on how to make a Service Now request to be able to update the tags etf6-qa and etf6-stable . lxplus is a convenient place to do Koji commands, although it should be possible on a local machine / laptop too.
This configuration in $HOME/.koji/config is sufficient:
[koji]
server = http://koji.cern.ch/kojihub
weburl = http://koji.cern.ch/koji
topurl = http://koji.cern.ch/kojifiles
cert = /not/existing/file
One workflow is to update the LHCbSAM
GitLab repo from the usual place you work on Git repos, and then have a temporary directory to make the SRPM:
rm -Rf /tmp/$USER/koji
mkdir -p /tmp/$USER/koji
cd /tmp/$USER/koji
git clone https://gitlab.cern.ch/lhcb-dirac/LHCbSAM.git
cd LHCbSAM/
make rpm
koji build etf6 RPMTMP/SRPMS/nagios-plugins-wlcg-org.lhcb-0.3.14-1.el6.src.rpm
koji tag-pkg etf6-qa nagios-plugins-wlcg-org.lhcb-0.3.14-1.el6
The current version is given in the VERSION file. These commands build the RPM in Koji and then tag it as etf6-qa, ready for installation on the etf-lhcb-preprod machine. The koji build command shows you the progress of the build, but you can also see the results
on the Koji website
.
The RPM can only be installed once it has been copied to the linuxsoft.cern.ch repository. There are separate trees for
qa/preprod
and
production
.
Once the RPM is visible in the repo, the YUM update can be forced with:
yum clean all
yum -y update nagios-plugins-wlcg-org.lhcb
ncgx periodcally rebuilds the Nagios configuration in a cron job defined by /opt/omd/sites/etf/etc/cron.d/ncgx You can force this to happen immediately with these commands (i.e. run as etf):
su - etf
ncgx
cmk -O
If this happens successfully the generated files in /etc/ncgx/conf.d will have recent time stamps.
Robot certificate
The ETF machines use a GSI proxy for running probes directly and for submitting grid jobs which run probes. These proxies are obtained from the CERN myproxy service, based on proxies we deposit there. LHCb uses a robot certificate owned by the lbdirac account for this. The proxy can be deposited with commands like this:
export GT_PROXY_MODE=rfc
myproxy-init -c 9000 -t 24 -k NagiosRetrieve-ETF-lhcb -s myproxy.cern.ch -l nagios -x -Z "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samtf/CN=555091/CN=Robot: SAM Test Framework" -C $CERT_FILE -y $KEY_FILE
Repainting
Problems with site availability monitoring can be corrected after the fact following the instructions here:
https://twiki.cern.ch/twiki/bin/view/ArdaGrid/ProfileCorrections
For example, with a copy of the private key without a password (!) in userkeynopw.pem:
echo -e '2015-05-05 00:00:00\t2015-05-11 12:00:00\t720\tLCG.RRCKI.ru\tOK\tgreen\tNone\tnvalue=0' > /tmp/my_data
curl -k -X POST -T /tmp/my_data --cert ~/.globus/usercert.pem --key /tmp/userkeynopw.pem 'https://wlcg-mon.cern.ch/dashboard/request.py/postMetricValues'
rm /tmp/userkeynopw.pem
--
AndrewMcNab - 2017-02-13