LHCbSAM for ETF

This page describes how to maintain the LHCbSAM probes used with Nagios within ETF. The results of these probes are used to generate the monthly WLCG site Availability & Reliability reports, and are visible on the SAM3 Dashboard and in a more raw format on the etf-lhcb-prod Check_MK dashboard. There is also an etf-lhcb-preprod version of the Check_MK dashboard for testing.

References to "ask the ETF team" currently mean "ask Marian Babik".

LHCb ETF Machines

LHCb has access to 2 machines which are running ETF, configured for LHCb: etf-lhcb-preprod.cern.ch and etf-lhcb-prod.cern.ch The first machine is used for testing with the "QA" (= Quality Assurance) version of the service. Ask the ETF team for root access to these machines.

LHCbSAM in gitlab.cern.ch

LHCb maintains the LHCbSAM repository in GitLab which is used to build the RPM nagios-plugins-wlcg-org.lhcb which is installed on the ETF machines (preprod first for testing of course.) This repository has a rather deep directory structure:

LHCbSAM/Makefile
LHCbSAM/Makefile.koji
LHCbSAM/README.md
LHCbSAM/VERSION
LHCbSAM/nagios-plugins-wlcg-org.lhcb.spec
LHCbSAM/usr/lib/ncgx/x_plugins/lhcb_vofeed.py
LHCbSAM/usr/lib/ncgx/x_plugins/lhcb_webdav.py
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/check_pilot_results
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/srmvometrics.py
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb.gridJob.jdl.template
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/etc/wn.d/org.lhcb/commands.cfg
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/etc/wn.d/org.lhcb/services.cfg
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/SRM-lhcb-FileAccess
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-cvmfs
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-lhcb-FileAccess
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-mjf
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-sft-brokerinfo
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-sft-csh
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-sft-lcg-rm-gfal
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-sft-vo-swdir
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-sft-voms
LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/WN-vo-id-card

The lhcb_vofeed.py script takes the XML LHCb VO Feed and generates pieces of configuration required by Nagios from it. The VO Feed format is specified by WLCG and documented here.

There are three classes of probe which are run by Nagios from the ETF LHCb machine:

  • Probes in LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/ which are executed on the ETF LHCb machine
  • Probes in LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/ run on Worker Nodes inside grid jobs submitted by ETF
  • Probes in LHCbSAM/usr/libexec/grid-monitoring/probes/org.lhcb/wnjob/org.lhcb/probes/org.lhcb/ run by the DIRAC Pilot 3.0 in VMs (currently) and on grid Worker Nodes (not yet).

The current version is given by the VERSION file and embedded in the nagios-plugins-wlcg-org.lhcb.spec file which is used during the remote build by Koji. It is currently essential that both are kept in sync!

Building the RPM with Koji

It would be possible to build the nagios-plugins-wlcg-org.lhcb RPM manually and install it on the ETF LHCb machines. However, these machines may be regenerated from scratch with no/short notice and so the RPMs must be built by Koji and installed by a YUM update.

There is an introduction to Koji. Before doing anything, follow its instructions on how to make a Service Now request to be able to update the tags etf6-qa and etf6-stable . lxplus is a convenient place to do Koji commands, although it should be possible on a local machine / laptop too.

This configuration in $HOME/.koji/config is sufficient:

[koji]
server = http://koji.cern.ch/kojihub
weburl = http://koji.cern.ch/koji
topurl = http://koji.cern.ch/kojifiles
cert   = /not/existing/file

One workflow is to update the LHCbSAM GitLab repo from the usual place you work on Git repos, and then have a temporary directory to make the SRPM:

rm -Rf /tmp/$USER/koji
mkdir -p /tmp/$USER/koji
cd /tmp/$USER/koji
git clone https://gitlab.cern.ch/lhcb-dirac/LHCbSAM.git
cd LHCbSAM/
make rpm
koji build etf6 RPMTMP/SRPMS/nagios-plugins-wlcg-org.lhcb-0.3.14-1.el6.src.rpm
koji tag-pkg etf6-qa nagios-plugins-wlcg-org.lhcb-0.3.14-1.el6

The current version is given in the VERSION file. These commands build the RPM in Koji and then tag it as etf6-qa, ready for installation on the etf-lhcb-preprod machine. The koji build command shows you the progress of the build, but you can also see the results on the Koji website.

The RPM can only be installed once it has been copied to the linuxsoft.cern.ch repository. There are separate trees for qa/preprod and production.

Once the RPM is visible in the repo, the YUM update can be forced with:

yum clean all
yum -y update nagios-plugins-wlcg-org.lhcb

ncgx periodcally rebuilds the Nagios configuration in a cron job defined by /opt/omd/sites/etf/etc/cron.d/ncgx You can force this to happen immediately with these commands (i.e. run as etf):

su - etf
ncgx
cmk -O

If this happens successfully the generated files in /etc/ncgx/conf.d will have recent time stamps.

Robot certificate

The ETF machines use a GSI proxy for running probes directly and for submitting grid jobs which run probes. These proxies are obtained from the CERN myproxy service, based on proxies we deposit there. LHCb uses a robot certificate owned by the lbdirac account for this. The proxy can be deposited with commands like this:

export GT_PROXY_MODE=rfc
myproxy-init -c 9000 -t 24 -k NagiosRetrieve-ETF-lhcb  -s myproxy.cern.ch -l nagios -x -Z "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samtf/CN=555091/CN=Robot: SAM Test Framework" -C $CERT_FILE -y $KEY_FILE

Repainting

Problems with site availability monitoring can be corrected after the fact following the instructions here: https://twiki.cern.ch/twiki/bin/view/ArdaGrid/ProfileCorrections

For example, with a copy of the private key without a password (!) in userkeynopw.pem:

echo -e '2015-05-05 00:00:00\t2015-05-11 12:00:00\t720\tLCG.RRCKI.ru\tOK\tgreen\tNone\tnvalue=0' > /tmp/my_data
curl -k -X POST -T /tmp/my_data --cert ~/.globus/usercert.pem --key /tmp/userkeynopw.pem 'https://wlcg-mon.cern.ch/dashboard/request.py/postMetricValues' 
rm /tmp/userkeynopw.pem

-- AndrewMcNab - 2017-02-13

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2019-07-18 - AndrewMcNab
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback