Introduction

The purpose of the group is to make sure that site squids are operating smoothly and to assist squid administrators to fix problems with their squids. This includes dealing with failovers to ATLAS, CMS, and CVMFS servers.

Members

  • Barry Jay Blumenfeld (CMS)
  • Edita Kizinevic (CMS)
  • Michal Svatos (ATLAS)

Contact

Failover monitoring

Description

The following text describes (as of 20.6.2019) how various input files are processed to obtain failovers and how monitoring web page displays them.

Script

Summary of input files:
  • config file
  • list of squids and their properties (grid-squids.json)
  • name translation file
  • records of failover activity in the last 72 hours (failover-record.tsv)
  • awstats files
  • records of awstats activity for each machine group (situation from the last time the script was running)
  • records of emails sent in the last 72 hour (email-record.tsv)

Summary of output files:

  • records of emails sent in the last 72 hour (with addition of currently sent emails)
  • records of failover activity in the last 72 hours (with addition of current failovers)
  • records of failover activity in the last 72 hours (with addition of current failovers) with squids removed
    • This file is used by the monitoring webpage
  • summary of daily activity per site (site_summary.tsv)

The procedure is following:

  1. From the list of squids, it extracts host name, site name, IPs and IP ranges.
  2. It loads last 72 hours of failover history.
  3. for each Group (group of machines as defined by awstats, e.g. cmscernbp, atlastriumf, etc.):
    1. It reads following from the 72 hour history: Timestamp, Group, Sites, Host, IP, IsSquid, Bandwidth, BandwidthRate, Hits, HitsRate (meaning of these values will be explained later).
    2. It reads list of machines in the Group (from a config file) and checks awstats files for each machine, extracting hostname, number of hits, bandwidth, and IP (obtained from the host name by socket.getaddrinfo).
      • if a corrupted awstats records (with many l added to the beginning of the host name) are found, they are cleaned (if host name has more than 2 l in it, they are counted and the number of symbols is removed from the beginning the the host name)
    3. It reads history of awstats activity file (data from the last time the script was running) and writes current situation into that file.
    4. Data from current (i.e. data from awstats folders) and previous (i.e. data from awstats history file the script reads and writes) execution of the script and failover activity record is created for cases when host name and IP in current and previous are matched (and if there is increase in hits and bandwidth in current with respect to the previous) and then for host names in current which are not in previous (when new activity appears):
      • Group - group of machines as defined by awstats, e.g. cmscernbp, atlastriumf, etc.
      • Sites
        • if the IP is not found, it becomes "Unknown"
        • if the IP is found then the site name comes from geoip.org_by_addr which processes it
        • if geoip.org_by_addr cannot find a site name for the IP, it becomes "Unknown"
      • Host
        • if host name is any of '127.0.0.1', 'localhost', 'localhost6', '::1' then it is set to "localhost" otherwise it is set to host name from current data
      • IP
      • IsSquid
        • if the IP is found in list of squids the it is set to True
      • Bandwidth
        • bandwidth from previous is subtracted from current
      • BandwidthRate
        • bandwidth from previous is subtracted from current and then it is divided by difference in timestamps
      • Hits
        • hits from previous are subtracted from current
      • HitsRate
        • hits from previous are subtracted from current and then it is divided by difference in timestamps
    5. Those are active failovers. For them, the following is done:
      1. Any host name in Unknown site which ends with cern.ch is moved to CERN-EuropeanOrganizationforNuclearResearch
      2. Site names are translated from geoip names to experiment specific names. This is done by other script outside of the main script and the main script just reads its output. But its functionality is:
        1. List of squids is read and for each squid, pair site name and geoip site name of the squid is kept.
        2. For CMS, site names are matched with lcg names from cms-lcg-sites.json. If the match is found, pair cms name and geoip name is kept.
        3. For ATLAS, site names are matched with "name" names from agis_site_info.json. If the match is found, pair "rc_site" name and geoip name is kept.
      3. If site name consist of several sites, the site names are searched in list of squids. For those which are found, IP ranges are extracted. Then it is checked if the IP in the failover activity record belongs to the IP ranges. If match is found, the site name is set to matched site.
        • range 0.0.0.0/0 is ignored as it seems every IP belongs there
      4. For each site, total hit rate coming from it is calculated as sum of hit rates from each host name belonging to it. If the hit rate exceeds a threshold (defined in config file), it is considered offending site and added into records of failover activity with current timestamp.
    6. If site has records of failover activity two last consecutive records of failover activity, it is considered persistent failover. Two last consecutive records means:
      • There is one record which has timestamp less than 400 s from now.
      • There is one record which has timestamp more than 3000 s from now but less than 4000 s from now
    7. For persistent failovers, summary per site is made. For each site, it provides sum per each machine group of hits in the last hour and last 24 hours.
  4. Record of emails which were sent in the last 72 hours is read. Emails which were sent in the last 24 hours are extracted. For persistent failover sites which are not in on list of emails from the last 24 hours, emails (based on pre-prepared template) are sent to addresses from config file.
    This is an automated message created at 2019-06-20 08:18:58 (UTC).  Many database queries from your site have connected
    directly to the following Frontier Server Groups during the last hour, with a high
    rate of queries not going through your local squid(s):
    
    Group           GCode           RateThreshold [*]
    FNAL Stratum 1  cvmfs.fnal.gov        5.56
    
    
    The most common sources of this problem are:
        1. Squids are not running
        2. Squids are not listed in site-local-config.xml
        3. Not all local IP addresses are in squid's access control lists
    
    When you have found the cause of the problem or if you have any questions, please
    contact wlcg-squid-ops@cern.ch by replying to this message.
    
    The record of Frontier activity from your site during the last period (60 minutes)
    is displayed below. The full access history during the past 72 hours is available
    at http://wlcg-squid-monitor.cern.ch/failover/failoverCvmfs/failover.html
    
    [*] The rate is the effective number of queries per second over each period.
    
    ===== Record of Frontier server accesses in the last hour =====
    Site: Unknown
    Host aggregation:
    IsSquid  GCode           Hits   Bandwidth
    False    cvmfs.fnal.gov  35453  8.78 GiB
    
    
    Detailed table:
    IsSquid  GCode           Host                                                  Ip              Hits   Bandwidth
    False    cvmfs.fnal.gov  localhost.localdomain                                 127.0.0.1       30208   7.80 GiB
    False    cvmfs.fnal.gov  bb118-200-69-89.singnet.com.sg                        NoIpFound        112   45.72 MiB
    True     cvmfs.fnal.gov  grinr06.inr.troitsk.ru                                185.207.88.136    2     1.73 kiB
    False    cvmfs.fnal.gov  no-mans-land.m247.com                                 NoIpFound        12    10.86 kiB
    False    cvmfs.fnal.gov  66.37.33.202.lightower.net                            NoIpFound         5     4.37 kiB
    False    cvmfs.fnal.gov  174-30-26-155.wrbg.centurylink.net                    NoIpFound       1305   214.66 MiB
    False    cvmfs.fnal.gov  customer15-179.lawireless.it                          NoIpFound         3     2.15 kiB
    False    cvmfs.fnal.gov  fsquid.swt2.atlas-swt2.org.uta.edu                    NoIpFound        30    25.78 kiB
    False    cvmfs.fnal.gov  dynamic-216-186-138-162.knology.net                   NoIpFound        49    148.32 kiB
    False    cvmfs.fnal.gov  dynamic-216-186-150-154.knology.net                   NoIpFound         5     4.53 kiB
    False    cvmfs.fnal.gov  storagelabinfo1.lncc.br                               NoIpFound         2     0.63 kiB
    False    cvmfs.fnal.gov  96-78-73-89-static.hfc.comcastbusiness.net            NoIpFound         5     4.37 kiB
    False    cvmfs.fnal.gov  86-126-27-130.rdsnet.ro                               NoIpFound        149   39.28 MiB
    False    cvmfs.fnal.gov  ads-pool.nat.uw.edu                                   NoIpFound        15    13.65 kiB
    False    cvmfs.fnal.gov  209-6-203-217.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com  NoIpFound         5     4.37 kiB
    False    cvmfs.fnal.gov  107-145-170-127.res.bhn.net                           NoIpFound         4     3.49 kiB
    False    cvmfs.fnal.gov  74.82.240.18.ifibertv.com                             NoIpFound         8     7.27 kiB
    False    cvmfs.fnal.gov  46-140-17-230.static.cablecom.ch                      NoIpFound        44    19.47 MiB
    False    cvmfs.fnal.gov  130.20.68.60.pnnl.gov                                 NoIpFound        30    26.67 kiB
    False    cvmfs.fnal.gov  epldt085.ph.bham.ac.uk                                NoIpFound         1     325.00 B
    False    cvmfs.fnal.gov  cpe-121-208-32-42.cfui-cr-004.woo.qld.bigpond.net.au  NoIpFound       2716   461.39 MiB
    False    cvmfs.fnal.gov  host184537317.direcway.com                            NoIpFound        46     2.64 MiB
    False    cvmfs.fnal.gov  wlg-nat.solnetsolutions.co.nz                         NoIpFound         3     3.35 kiB
    False    cvmfs.fnal.gov  adsl-ull-50-53.49-151.wind.it                         NoIpFound         2     1.82 kiB
    False    cvmfs.fnal.gov  ip52-181.comteam.at                                   NoIpFound         1     325.00 B
    False    cvmfs.fnal.gov  home-78-83-130-166.optinet.bg                         NoIpFound         2     1.82 kiB
    False    cvmfs.fnal.gov  cust-210.241.102.5.018.net.il                         NoIpFound        633   178.11 MiB
    False    cvmfs.fnal.gov  pool-6-193.106.50.119.o.kg                            NoIpFound        11     2.25 MiB
    False    cvmfs.fnal.gov  188-26-142-130.rdsnet.ro                              NoIpFound         2     0.63 kiB
    False    cvmfs.fnal.gov  pop-139.126.escom.bg                                  NoIpFound        11    19.25 MiB
    False    cvmfs.fnal.gov  cable-37-120-81-23.cust.telecolumbus.net              NoIpFound        32    22.57 MiB
    
  5. Output files are written.

Web page

The webpage combines HTML with heavy usage of JavaScript. It contains:
  • some introductory info
  • information about time range currently being displayed
  • Machine Groups pie chart with legend
    • Clicking on machine group either in the pie chart or in the legend filters "Hits by site per hour" plot and "Access History Detail" table.
    • Blue word "Reset" next to the "Machine Groups" text resets the filtering
  • "Hits by site per hour" plot displays number of hits from failover sites in one hour bins and legend
    • hovering over a site name in legend hides hits from other sites
    • hovering over a bar on the plot displays site name and number of hits in the bar
  • "Access History Detail" table displays for each site, all its WNs, timestamp, number of hits and bandwidth
    • hovering over a host name displays IP
    • hovering over number of hits displays hits rate
    • hovering over bandwidth displays bandwidth rate
  • "Record of Email Alarms" table shows sites which raised email alarm in the last 72 hours. It shows site name, who got the email and when.

Monitoring pages

Code

Email alerts

The failover monitor sends an email alert when number of hits from a site exceeds threshold defined in the config twice in a row. People who get these emails are:
  • ATLAS: Michal Svatos
  • CMS: cms-frontier-alarms
  • CVMFS: wlcg-squid-ops
  • CernVM: Dave Dykstra

ATLAS Elastic Search

The Elastic Search hosted in Chicago provides details of job logs which can provide further details in investigation.

How to search details in ATLAS Elastic Search

  1. Open the Elastic Search page (needs user account)
  2. Select "frontier_sql" index
  3. Click on "Add a filter"
  4. Choose "clientmachine" and "is"
  5. Put IP address of the machine as the value.
  6. Save

Filtering

  • to filter a message
message:value
  • to filter out a message
NOT message:value

-- MichalSvatos - 2019-05-09

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2019-08-26 - MichalSvatos
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Frontier All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback