Frontier squids

Frontier contacts

  • CERN: Alessandro.DeSalvo@roma1.infn.it
  • RAL: tim.adye@cern.ch
  • IN2P3-CC: emmanouil.vamvakopoulos@cc.in2p3.fr
  • TRIUMF: frontier@lcg.triumf.ca

Frontier machines

  • CERN machines
  • IN2P3-CC machines
    • ccfrontier01.in2p3.fr
    • ccfrontier02.in2p3.fr
    • ccfrontier03.in2p3.fr
    • ccfrontier04.in2p3.fr (testing)
    • ccfrontier05.in2p3.fr
  • RAL machines (also correctly in AGIS)
    • lcgvo-frontier01.gridpp.rl.ac.uk
    • lcgvo-frontier02.gridpp.rl.ac.uk
    • lcgvo-frontier03.gridpp.rl.ac.uk
  • TRIUMF machines (also correctly in AGIS)
    • frontier-atlas1.lcg.triumf.ca
    • frontier-atlas2.lcg.triumf.ca
    • frontier-atlas3.lcg.triumf.ca

Frontier squids monitoring

Every investigation can be started by checking what crons are running:
sudo -u dbfrontier crontab -l

Access

From lxplus only
  • to master machine (to check crons, etc.)
ssh msvatos@frontiermon1.cern.ch
  • to backup machine (to test something in production, etc.)
ssh msvatos@frontiermon2.cern.ch
Monitoring scripts are located at ~dbfrontier/scripts.

Testing

Everything in dbfrontier on the master machine is copied to the backup machine every 5 minutes. To stop it (for example if I need to do some testing), I need to:
  • for short periods of testing:
    • log into backup machine and rename ~dbfrontier/.ssh/authorized_keys. This will cause synchronization errors if too long (hours).
  • for longer testing:
    • edit /home/dbfrontier/scripts/bin/identical_frontiermon by adding folders I do not want to have rewritten into EXCLUDE line

Code development

There is a git repo.
  • checkout
git clone https://:@gitlab.cern.ch:8443/frontier/frontiermon.git
  • commit (in frontiermon folder) of changes on existing files
    1. git pull
    2. git add file
    3. git commit -m 'reason'
    4. git push
  • publish
~dbfrontier/bin/frontiermon_publish

N.B. Sometimes it is necessary to go to the script after it is published and set execution rights to it.

awstats monitoring

  • related JIRA tickets (assigned to me):
  • config file (in checked out repo files):
    • wlcgsquidmon/conf/awstats/SiteProjectNodesMapping - sites from which awstats are read. Syntax:
      • SiteProject - grouping according to the site
      • DNS alias - actual name of the machine
      • awstats name - name of the machine (I choose) in awstats monitoring
      • role - launchpad/proxy
      • mode - production/testing
  • the webpage uses perl script from awstats installation (/home/squidmon/etc/awstats/wwwroot/cgi-bin/awstats.pl), i.e. the pages for individual frontiers and time intervals come from awstats itself. ATLAS creates only the summary page
  • the page is created/updated by hand. It can be updated at wlcgsquidmon/wwwsrc/awstats/atlas.html in the SVN

Submission

  1. conf/awstats/SiteProjectNodesMapping commited to SVN and published (by running ~squidmon/bin/squidmon_publish on the wlcgsquidmon2.cern.ch)
  2. Dave propagated it to the system (made ~squidmon/etc/awstats/wwwroot/cgi-bin/awstats.atlasfrontier-local-1.conf, etc.)
  3. wwwsrc/awstats/atlas.html commited to SVN and published (by running ~squidmon/bin/squidmon_publish on the wlcgsquidmon2.cern.ch)
  4. https://its.cern.ch/jira/browse/FTOPSDEVEL-122 updated

maxthreads monitoring

/home/dbfrontier/data/awstats/triumfatlas/chkthread_triumf-frontier-1/maxthreads.triumf-frontier-1.2017-08-04
# [08/04/17 00:34:51.215 PDT -0700] to [08/04/17 00:39:51.450 PDT -0700]
2017/08/04 00:39:51 ATLAS_frontier maxthreads=7 averagetime=783.631 msec avedbquerytime=2139.39 msec threadsthreshold=450
/home/dbfrontier/data/awstats/ralatlas/chkthread_ral-lcgvo-frontier01/maxthreads.ral-lcgvo-frontier01.2017-08-04 
# [08/04/17 00:00:05.342 BST +0100] to [08/04/17 00:04:51.555 BST +0100]
2017/08/04 00:04:51 frontierATLAS maxthreads=36 averagetime=4038.39 msec avedbquerytime=13888.4 msec threadsthreshold=375
/home/dbfrontier/data/awstats/cernatlas/chkthread_atlasfrontier-ai-1/maxthreads.atlasfrontier-ai-1.2017-08-04
# [08/04/17 00:00:31.087 CEST +0200] to [08/04/17 00:04:51.606 CEST +0200]
2017/08/04 00:04:51 atlr maxthreads=21 averagetime=151.492 msec avedbquerytime=75.9192 msec threadsthreshold=375
2017/08/04 00:04:51 t0atlr maxthreads=0 averagetime=0 msec avedbquerytime=0 msec threadsthreshold=375
/home/dbfrontier/data/awstats/ccin2p3atlas/chkthread_ccin2p3-ccfrontier01/maxthreads.ccin2p3-ccfrontier01.2017-08-04 
# [08/04/17 00:03:17.761 CEST +0200] to [08/04/17 00:04:53.400 CEST +0200]
2017/08/04 00:04:52 ccin2p3-AtlasFrontier maxthreads=97 averagetime=40834.4 msec avedbquerytime=10337.7 msec threadsthreshold=337
2017/08/04 00:04:52 t0atlr maxthreads=0 averagetime=0 msec avedbquerytime=0 msec threadsthreshold=337
  • config file:
    • ~dbfrontier/apps/Monitor/Apps/maxthreads_monitor.config.atlas - on frontier server - sites displayed in the monitoring. Syntax:
      • servers - awstats frontier names
      • srcdir - directory, where awstats data will be stored
      • mailaddr - email address to receive alerts (atlas-frontier-support)
  • script file:
    • apps/Monitor/Apps/maxthreads_monitor.py
    • scripts/maxthreads/maxthreads_monitor.py
  • log file (of script running):
    • /home/dbfrontier/local/logs/maxthreads_monitor_new.log
  • www pages:
    • /home/dbfrontier/local/apache/frontier/maxthreads/
  • cronjob:
# Puppet Name: Frontier instance threads monitor
*/5 * * * * [ "`cl_status rscstatus 2>/dev/null`" == "all" ] || exit; cd /home/dbfrontier/apps/Monitor/Apps; (./maxthreads_monitor.sh cms;./maxthreads_monitor.sh atlas) >> /home/dbfrontier/local/logs/maxthreads_monitor.log 2>&1
  • it means the cron runs bash script ./maxthreads_monitor.sh atlas every five minutes. The bash script sets paths, uses maxthreads_monitor.py to create html pages from awstat and then creates main monitoring webpage which uses them
  • there is also
# Puppet Name: New Frontier MaxThreads monitor implementation (RRD-based) (ATLAS)
*/5 * * * * [ "`cl_status rscstatus 2>/dev/null`" == "all" ] || exit; /home/dbfrontier/scripts/maxthreads/maxthreads_update.sh atlas >> /home/dbfrontier/local/logs/maxthreads_monitor_new.log 2>&1
  • this thing executes maxthreads_update.py (every 5 minutes) which sends some mail alerts but also writes servlet status file which is used by maxthreads_monitor.py (also for some alerts)

Submission

  1. ~dbfrontier/apps/Monitor/Apps/maxthreads_monitor.config.atlas updated, the cron job then regenerated the page
  2. https://its.cern.ch/jira/browse/FTOPSDEVEL-122 updated

Kibana monitoring

# Puppet Name: Execute SLS probes
*/5 * * * * [ "`cl_status rscstatus 2>/dev/null`" == "all" ] || exit; /home/dbfrontier/scripts/slsfrontier/sls.sh &> /dev/null
  • sls.sh runs
    • /home/dbfrontier/scripts/slsfrontier/sls_frontier.pl atlas which checks status of the frontier and updates xml files in /home/dbfrontier/etc/slsfrontier/
    • /home/dbfrontier/scripts/slsfrontier/sls_frontier.py atlas which is replacement for the sls_frontier.pl, i.e. it should do the same thing but it should be easier to debug
    • /home/dbfrontier/scripts/slsfrontier/kibanaMon.sh which reads the xml files and curl them into xsls.cern.ch, e.g.
curl -m 20 -sF file=/home/dbfrontier/etc/slsfrontier/sls_atlas_frontier_Lpad-IN2P3-CC.xml xsls.cern.ch

Submission

  1. ~dbfrontier/monit/slsfrontier/atlas_frontiers.txt updated, the cron job then regenerated the page
  2. https://its.cern.ch/jira/browse/FTOPSDEVEL-122 updated

Site squids

Every investigation can be started by checking what crons are running:
sudo -u squidmon crontab -l

Site squid monitoring

Access

From lxplus only
  • to master machine (to check crons, etc.)
ssh msvatos@wlcgsquidmon2.cern.ch
  • to backup machine (to test something in production, etc.)
ssh msvatos@wlcgsquidmon1.cern.ch
Monitoring scripts are located at ~squidmon/scripts.

Testing

Everything in ~squidmon on the master machine is copied to the backup machine every 5 minutes. To stop it (for example if I need to do some testing), I need to:
  • for short periods of testing:
    • log into backup machine and rename ~squidmon/.ssh/authorized_keys. This will cause synchronization errors if too long (hours).
  • for longer testing:
    • edit /home/squidmon/scripts/bin/identicalsquidmon by adding folders I do not want to have rewritten into EXCLUDE line

Code development

There is a git repo.
  • checkout master
git clone https://:@gitlab.cern.ch:8443/frontier/wlcgsquidmon.git
  • checkout centos7 branch
git clone https://:@gitlab.cern.ch:8443/frontier/wlcgsquidmon.git
cd wlcgsquidmon/
git checkout -b centos7 origin/centos7
  • commit (in wlcgsquidmon folder) of changes on existing files
    1. git pull
    2. git add file
    3. git commit -m 'reason'
    4. git push

  • publish
~squidmon/bin/squidmon_publish
N.B. Sometimes it is necessary to go to the script after it is published and set execution rights to it.

Change of wlcg-squid-monitor.cern.ch frontpage

  • in /home/squidmon/wwwsrc/index.html

WLCG mrtg monitoring

# Puppet Name: Generate squid information files
40 */3 * * * [ "`cl_status rscstatus 2>/dev/null`" == "all" ] || exit; /home/squidmon/scripts/make_squid_info.py >> /home/squidmon/logs/make_squid_info.log 2>&1
  • this script (wlcgsquidmon/scripts/make_squid_info.py) does following (every 3 hours):
#!/usr/bin/python
# This script calls other 4 scripts when cronjob runs.
# March 20, 2017 --Amjad
# Modified March 28, 2017

import sys
import os
import subprocess

subprocess.call('/home/squidmon/scripts/grid-services/get_squids.py >> /home/squidmon/logs/get_grid_squids.log 2>&1', shell=True)
subprocess.call('/home/squidmon/scripts/all/makeConfig.py >> /home/squidmon/logs/all/makeConfig.log 2>&1', shell=True)
subprocess.call('/home/squidmon/scripts/grid-services/cms/siteDB_map.py >> /home/squidmon/logs/siteDb_map.log 2>&1', shell=True)
subprocess.call('/home/squidmon/scripts/grid-services/make_worker_proxies.py >> /home/squidmon/logs/make_worker_proxies.log 2>&1', shell=True)
  • this does the following
    • get_squids.py reads a list of exceptions (file ~squidmon/conf/exceptions/monitoring.txt where can be additional squids added and others removed), then gets list of squid from GOCDB and OIM and writes it into ~squidmon/www/grid-squids.json. It is a JSON containing name of the squid, source (egi, osg), sitename, and IP
    • makeConfig.py uses/updates config files from /home/squidmon/etc/all/ and creates empty monitoring webpage for each squid based on some txt files with HTML templates

  • next cronjob:
# Puppet Name: Updating MRTG monitoring that uses GOCDB and OIM records
*/5 * * * * [ "`cl_status rscstatus 2>/dev/null`" == "all" ] || exit; /home/squidmon/scripts/all/mrtg_plots.py >> /home/squidmon/logs/all/mrtg_plots.log 2>&1
  • this does the following:
    • use mrtg program suite to make plots based on config files from etc/all
    • log of mrtg run for each site are in /home/squidmon/logs/all/

obsoleted ATLAS mrtg monitoring

2 */6 * * * [ "`cl_status rscstatus 2>/dev/null`" == "all" ] || exit; /home/squidmon/scripts/cron/atlasMRTG.job >> /home/squidmon/logs/mrtg/mrtg.log 2>&1
  • this does the following:
    • /home/squidmon/scripts/cron/atlasMRTG.job consists of few python scripts which take frontier/squid info from AGIS and few bash scripts which make html pages based on that (basically the same functionality as make_squid_info.py but for ATLAS)
    • this is the script I need to recreate

current ATLAS mrtg monitoring

  • related JIRA tickets (assigned to me):
  • Uses the same cronjob as all page
  • scripts:
    • SquidList.py reads squid info from GOCDB/OIM and AGIS, matches sites+squids (from GOCDB/OIM) with sites+endpoints (from AGIS). Then it matches sites+squids (from GOCDB/OIM) with sites+nodes (from AGIS). Finally, it writes output into a JSON
    • PageBuilder.py reads list of squids from JSON and creates the webpage (at /home/squidmon/www/snmpstats/mrtgatlas2)
    • both scripts are located at /home/squidmon/scripts/mrtg/atlas2
    • exceptions are in /home/squidmon/conf/exceptions/mrtgatlas2exceptions.txt
    • scripts are run by /home/squidmon/scripts/make_squid_info.py
N.B. RRC-KI-T1 has a strict firewall policy. In case IPs of monitoring machines change, we need to let them know. They will open the firewall for them and monitoring will work again.

SSB monitoring

# Puppet Name: MRTG monitoring (2)
*/25 * * * * [ "`cl_status rscstatus 2>/dev/null`" == "all" ] || exit; /home/squidmon/scripts/cron/atlasSSB.job >> /home/squidmon/logs/atlasSSB.log 2>&1
  • this does the following:
    • runs squidmon/scripts/mrtg/atlas/SquidState.py script - for Frontier_Squid
      • it reads JSON files created by MRTG monitoring to get site names and number of squids
      • then it reads proxy-hit.html webpages created by MRTG monitoring to check if the squid is OK or down (if the page contains word time, it is set OK; if not, it checks number of current requests in weekly hits table (bin is 30 min average) - in case the unavailability is just a fluctuation, the average will be not zero and squid will still be marked as OK; otherwise, squid is marked down)
      • creates output file /home/squidmon/www/ssb/frontier-squid.data with statuses in format SSB can (presumably) read
    • runs /home/squidmon/scripts/mrtg/atlas/SquidAvailability.py which creates http://wlcg-squid-monitor.cern.ch/snmpstats/SquidAvailability.html

Submission

  1. new SquidState.py commited to SVN and published (by running ~squidmon/bin/squidmon_publish on the wlcgsquidmon2.cern.ch)
  2. https://its.cern.ch/jira/browse/FTOPSDEVEL-111 updated

Version monitoring

Monitoring of squid package version as reported to MRTG. Scripts are not in cron. They are not intended to be run often. This is just for information.
  • monitoring:
  • scripts
    • /home/squidmon/scripts/versmon/SquidVersionAll.py
    • /home/squidmon/scripts/versmon/SquidVersionATLAS.py

Failover monitoring

# Puppet Name: Failover monitor
18 * * * * [ "`cl_status rscstatus 2>/dev/null`" == "all" ] || exit; /home/squidmon/scripts/failover-mon/cms/check-failover.sh >> /home/squidmon/logs/failover-mon-cms.log 2>&1

SAM tests

Code development

First, the official reponeeds to be forked. This is done by clicking on Fork button on the top right. This creates my fork of the repo. Then the checkout
  • checkout
git clone https://:@gitlab.cern.ch:8443/msvatos/atlas-sam.git
git checkout develop
  • commit (in atlas-sam folder) of changes on existing files
    1. git pull
    2. git add file
    3. git commit -m 'reason'
    4. git push
  • create a merge request
    • Open my fork of the repo, click on "Merge Requests" on the left sidebar, and then on "New Merge Request". Choose develop on my repo and develop branch in etf and do compare. Then fill title and description and submit.

wpad

export FRONTIER_SERVER="(serverurl=http://atlasfrontier-ai.cern.ch:8000/atlr)(serverurl=http://atlasfrontier2-ai.cern.ch:8000/atlr)(serverurl=http://atlasfrontier1-ai.cern.ch:8000/atlr)(serverurl=http://ccfrontier.in2p3.fr:23128/ccin2p3-AtlasFrontier)(serverurl=http://ccfrontier01.in2p3.fr:23128/ccin2p3-AtlasFrontier)(serverurl=http://ccfrontier05.in2p3.fr:23128/ccin2p3-AtlasFrontier)(proxyconfigurl=http://grid-wpad/wpad.dat)(failovertoserver=no)"
export FRONTIER_LOG_LEVEL=debug
    • workload in rel 19.2.4.9
export ATHENA_PROC_NUMBER=1
asetup 19.2.4.9,AtlasProduction,here
Sim_tf.py --inputEVNTFile="EVNT.10267546._000001.pool.root.1" --AMITag="s2698" --DBRelease="default:current" --DataRunNumber="222525" --conditionsTag "default:OFLCOND-RUN12-SDR-19" --firstEvent="4001" --geometryVersion="default:ATLAS-R2-2015-03-01-00_VALIDATION" --maxEvents="1" --outputHITSFile="HITS.10287255._000020.pool.root.1" --physicsList="FTFP_BERT" --postInclude "default:PyJobTransforms/UseFrontier.py" --preInclude "EVNTtoHITS:SimulationJobOptions/preInclude.BeamPipeKill.py,SimulationJobOptions/preInclude.FrozenShowersFCalOnly.py" --randomSeed="5" --runNumber="403247" --simulator="MC12G4" --skipEvents="4000" --truthStrategy="MC12LLP"
    • workload in rel 21.0.31
asetup --cmtconfig=x86_64-slc6-gcc62-opt Athena,21.0.31
export ATHENA_PROC_NUMBER=1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_88/pacparser/1.3.5/x86_64-slc6-gcc62-opt/lib
Sim_tf.py --inputEVNTFile="EVNT.15813151._000093.pool.root.1" --maxEvents="1" --postInclude "default:RecJobTransforms/UseFrontier.py" --preExec "EVNTtoHITS:simFlags.SimBarcodeOffset.set_Value_and_Lock(200000)" "EVNTtoHITS:simFlags.TRTRangeCut=30.0;simFlags.TightMuonStepping=True" --preInclude "EVNTtoHITS:SimulationJobOptions/preInclude.BeamPipeKill.py" --skipEvents="1000" --firstEvent="891001" --outputHITSFile="HITS.15813158._006172.pool.root.1" --physicsList="FTFP_BERT_ATL_VALIDATION" --randomSeed="892" --DBRelease="all:current" --conditionsTag "default:OFLCOND-MC16-SDR-14" --geometryVersion="default:ATLAS-R2-2016-01-00-01_VALIDATION" --runNumber="410996" --AMITag="a875" --DataRunNumber="284500" --simulator="ATLFASTII" --truthStrategy="MC15aPlus" 

Decommissioning of a monitoring

  1. stop execution - either ask for removal from cron or remove particular script from file which is run by the cron
  2. remove files from SVN - scripts from scripts, config files from conf and webpages from wwwsrc
  3. delete data folders
  4. delete html of the monitoring page
  5. if it is on the wlcg-squid-monitor.cern.ch, remove it from there

Tools

snmpwalk

Command that provides all information about the squid, e.g.
/usr/bin/snmpwalk -m ~squidmon/conf/mrtg/mib.txt -v2c -Cc -c public squid.farm.particle.cz:3401 squid
  • definition of variables https://wiki.squid-cache.org/Features/Snmp#Squid_OIDs
  • important variables:
    • cache_mem : specifies the ideal amount of memory to be used for In-Transit objects, Hot Objects, and Negative-Cached objects
      • cacheSysVMsize - Amount of cache_mem storage space used, in KB.
      • cacheMemMaxSize - The value of the cache_mem parameter in MB
    • cache_dir :
      • cacheSwapMaxSize - The total of the cache_dir space allocated in MB - should be 100000 - 200000 for non-small sites
      • cacheSysStorage - Amount of cache_dir storage space used, in KB.
    • cacheMemUsage - Total memory accounted for KB
    • cacheCpuUsage - The percentage use of the CPU
    • cacheHttpErrors - if it is high, there is a problem (which is hard to say from outside what it is) - it needs a check of squid logs (access.log and cache.log)
    • cacheVersionId - version of the squid-frontier displayed on the MRTG monitoring page
    • cacheUptime - uptime displayed on the MRTG monitoring page
To get only one variable:
/usr/bin/snmpwalk -m ~squidmon/conf/mrtg/mib.txt -v2c -Cc -c public squid.farm.particle.cz:3401 cacheSwapMaxSize.0
Command that checks if the squid is working
snmpwalk -v2c -Cc -c public squidname:3401 .1.3.6.1.4.1.3495.1.1
  • if the snmpwalk gives timeouts, try traceroute squidname

nmap

To check if the squid has open monitoing port:
nmap -p 3401 squidname

ATLAS Elastic Search

The Elastic Search hosted in Chicago provides details of job logs which can provide further details in investigation.

How to search details in ATLAS Elastic Search

  1. Open the Elastic Search page (needs user account)
  2. Select "frontier_sql" index
  3. Click on "Add a filter"
  4. Choose "clientmachine" and "is"
  5. Put IP address of the machine as the value.
  6. Save

Filtering

  • to filter a message
message:value
  • to filter out a message
NOT message:value

WLCG-WPAD dashboard

  • link
  • the hits come when something on a WLCG site (or non-WLCG sites running WLCG jobs, e.g. LHC@Home jobs) tries to use proxy autodiscovery to get information about nearest squid or
  • services monitored (each has one server in FNAL and one in CERN):
    • wlcg-wpad - replies positively only at grid sites, and includes backup proxies at those sites after squids that are found. I think the only production use is old-config LHC@Home jobs.
    • lhchomeproxy - like wlcg-wpad except at non-grid sites it replies DIRECT for openhtc.io destinations, so will use cloudflare. Used by current LHC@Home jobs.
    • cernvm-wpad - like lhchomeproxy except at non-grid sites it watches for too many requests in too short of a time (more below) and if so directs them to cernvm backup proxies on port 3125. Used as default for CernVM, cvmfsexec, and soon to be the default configuration for cvmfs if people do not set CVMFS_HTTP_PROXY and are using the cvmfs-config-default configuration rpm.
    • cmsopsquid - like cervnm-wpad except too many requests from non-grid sites in too short of a time get sent to the cms backup proxies on port 3128. Used by CMS opportunistic jobs in the U.S.
  • dashboard content
    • type of info
      • no squid found - wpad returned no squid
      • no org found - nothing found in the geoip database
      • default squid - wpad returned site's squid
      • disabled - if site's squid is disabled (recorded in worker-proxies.json and/or shoal-squids.json)
      • overload - if there are too many requests from one geoip org in too short of a time
    • service names
      • hits per service
      • type of info for each of the service names

Decoding a query

  • command which decodes query from encoded string in the squid log: ~dwd/adm/bin/decodesquidlog

Contacts

-- MichalSvatos - 2017-06-19

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt SquidState.py.txt r1 manage 2.6 K 2017-08-11 - 19:31 MichalSvatos my heavily commented version of SSB feeder script
Texttxt maxthreads_monitor.py.txt r1 manage 29.6 K 2017-06-27 - 13:01 MichalSvatos my heavily commented version of maxthread script
Edit | Attach | Watch | Print version | History: r97 < r96 < r95 < r94 < r93 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r97 - 2020-04-23 - MichalSvatos
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback