The purpose of the group is to make sure that site squids are operating smoothly and to assist squid administrators to fix problems with their squids. This includes dealing with failovers to ATLAS, CMS, and CVMFS servers.


  • Barry Jay Blumenfeld (CMS)
  • Edita Kizinevic (CMS)
  • Michal Svatos (ATLAS)


Failover monitoring


The following text describes (as of 10.12.2019) how various input files are processed to obtain failovers and how monitoring web page displays them.

Summary of input files

  • config file
  • list of squids and their properties (grid-squids.json)
  • name translation file
  • records of failover activity in the last 72 hours (failover-record.tsv)
  • awstats files
  • records of awstats activity for each machine group (situation from the last time the script was running)
  • records of emails sent in the last 72 hour (email-record.tsv)

Summary of output files

records of emails sent in the last 72 hour (with addition of currently sent emails)

records of failover activity in the last 72 hours (with addition of current failovers)

records of failover activity in the last 72 hours (with addition of current failovers) with squids removed

summary of daily activity per site

high bandwidth logs:

The procedure

  1. From the list of squids, it extracts host name, site name, IPs and IP ranges.
  2. It loads last 72 hours of failover history.
  3. for each Group (group of machines as defined by awstats, e.g. cmscernbp, atlastriumf, etc.):
    1. It reads following from the 72 hour history: Timestamp, Group, Sites, Host, IP, IsSquid, Bandwidth, BandwidthRate, Hits, HitsRate (meaning of these values will be explained later).
    2. It reads list of machines in the Group (from a config file) and checks awstats files for each machine, extracting hostname, number of hits, bandwidth, and IP (obtained from the host name by socket.getaddrinfo).
      • if a corrupted awstats records (with many l added to the beginning of the host name) are found, they are cleaned (if host name has more than 2 l in it, they are counted and the number of symbols is removed from the beginning the the host name)
    3. It reads history of awstats activity file (data from the last time the script was running) and writes current situation into that file.
    4. From the awstats activity file, hosts using a lot of bandwidth on current day are extracted. The output files have the following grouping
      • by IsSquid - non-squids are first, squids after them
      • by bandwidth - in each of the groups (non-squid or squid), the hosts are sorted by bandwidth starting from the highest
    5. Data from current (i.e. data from awstats folders) and previous (i.e. data from awstats history file the script reads and writes) execution of the script and failover activity record is created for cases when host name and IP in current and previous are matched (and if there is increase in hits and bandwidth in current with respect to the previous) and then for host names in current which are not in previous (when new activity appears):
      • Group - group of machines as defined by awstats, e.g. cmscernbp, atlastriumf, etc.
      • Sites
        • if the IP is not found, it becomes "Unknown"
        • if the IP is found then the site name comes from geoip.org_by_addr which processes it
        • if geoip.org_by_addr cannot find a site name for the IP, it becomes "Unknown"
      • Host
        • if host name is any of '', 'localhost', 'localhost6', '::1' then it is set to "localhost" otherwise it is set to host name from current data
      • IP
      • IsSquid
        • if the IP is found in list of squids the it is set to True
      • Bandwidth
        • bandwidth from previous is subtracted from current
      • BandwidthRate
        • bandwidth from previous is subtracted from current and then it is divided by difference in timestamps
      • Hits
        • hits from previous are subtracted from current
      • HitsRate
        • hits from previous are subtracted from current and then it is divided by difference in timestamps
    6. Those are active failovers. For them, the following is done:
      1. Any host name in Unknown site which ends with is moved to CERN-EuropeanOrganizationforNuclearResearch
      2. Site names are translated from geoip names to experiment specific names. This is done by other script outside of the main script and the main script just reads its output. But its functionality is:
        1. List of squids is read and for each squid, pair site name and geoip site name of the squid is kept.
        2. For CMS, site names are matched with lcg names from cms-lcg-sites.json. If the match is found, pair cms name and geoip name is kept.
        3. For ATLAS, site names are matched with "name" names from agis_site_info.json. If the match is found, pair "rc_site" name and geoip name is kept.
      3. If site name consist of several sites, the site names are searched in list of squids. For those which are found, IP ranges are extracted. Then it is checked if the IP in the failover activity record belongs to the IP ranges. If match is found, the site name is set to matched site.
        • range is ignored as it seems every IP belongs there
      4. Dealing with exceptions defined in config file (strings separated by semicolon).
      5. For each site, total hit rate coming from it is calculated as sum of hit rates from each host name belonging to it. If the hit rate exceeds a threshold (defined in config file), it is considered offending site and added into records of failover activity with current timestamp.
    7. If site has records of failover activity two last consecutive records of failover activity, it is considered persistent failover. Two last consecutive records means:
      • There is one record which has timestamp less than 400 s from now.
      • There is one record which has timestamp more than 3000 s from now but less than 4000 s from now
    8. For persistent failovers, summary per site is made. For each site, it provides sum per each machine group of hits in the last hour and last 24 hours.
  4. Record of emails which were sent in the last 72 hours is read. Emails which were sent in the last 24 hours are extracted. For persistent failover sites which are not in on list of emails from the last 24 hours, emails (based on pre-prepared template) are sent to addresses from config file.
    This is an automated message created at 2019-06-20 08:18:58 (UTC).  Many database queries from your site have connected
    directly to the following Frontier Server Groups during the last hour, with a high
    rate of queries not going through your local squid(s):
    Group           GCode           RateThreshold [*]
    FNAL Stratum 1        5.56
    The most common sources of this problem are:
        1. Squids are not running
        2. Squids are not listed in site-local-config.xml
        3. Not all local IP addresses are in squid's access control lists
    When you have found the cause of the problem or if you have any questions, please
    contact by replying to this message.
    The record of Frontier activity from your site during the last period (60 minutes)
    is displayed below. The full access history during the past 72 hours is available
    [*] The rate is the effective number of queries per second over each period.
    ===== Record of Frontier server accesses in the last hour =====
    Site: Unknown
    Host aggregation:
    IsSquid  GCode           Hits   Bandwidth
    False  35453  8.78 GiB
    Detailed table:
    IsSquid  GCode           Host                                                  Ip              Hits   Bandwidth
    False  localhost.localdomain                              30208   7.80 GiB
    False                        NoIpFound        112   45.72 MiB
    True                          2     1.73 kiB
    False                                 NoIpFound        12    10.86 kiB
    False                            NoIpFound         5     4.37 kiB
    False                    NoIpFound       1305   214.66 MiB
    False                          NoIpFound         3     2.15 kiB
    False                    NoIpFound        30    25.78 kiB
    False                   NoIpFound        49    148.32 kiB
    False                   NoIpFound         5     4.53 kiB
    False                               NoIpFound         2     0.63 kiB
    False            NoIpFound         5     4.37 kiB
    False                               NoIpFound        149   39.28 MiB
    False                                   NoIpFound        15    13.65 kiB
    False  NoIpFound         5     4.37 kiB
    False                           NoIpFound         4     3.49 kiB
    False                             NoIpFound         8     7.27 kiB
    False                      NoIpFound        44    19.47 MiB
    False                                 NoIpFound        30    26.67 kiB
    False                                NoIpFound         1     325.00 B
    False  NoIpFound       2716   461.39 MiB
    False                            NoIpFound        46     2.64 MiB
    False                         NoIpFound         3     3.35 kiB
    False                         NoIpFound         2     1.82 kiB
    False                                   NoIpFound         1     325.00 B
    False                         NoIpFound         2     1.82 kiB
    False                         NoIpFound        633   178.11 MiB
    False                            NoIpFound        11     2.25 MiB
    False                              NoIpFound         2     0.63 kiB
    False                                  NoIpFound        11    19.25 MiB
    False              NoIpFound        32    22.57 MiB
  5. Output files are written.

Web page

The webpage combines HTML with heavy usage of JavaScript. It contains:
  • some introductory info
  • information about time range currently being displayed
  • Machine Groups pie chart with legend
    • Clicking on machine group either in the pie chart or in the legend filters "Hits by site per hour" plot and "Access History Detail" table.
    • Blue word "Reset" next to the "Machine Groups" text resets the filtering
  • "Hits by site per hour" plot displays number of hits from failover sites in one hour bins and legend
    • hovering over a site name in legend hides hits from other sites
    • hovering over a bar on the plot displays site name and number of hits in the bar
  • "Access History Detail" table displays for each site, all its WNs, timestamp, number of hits and bandwidth
    • hovering over a host name displays IP
    • hovering over number of hits displays hits rate
    • hovering over bandwidth displays bandwidth rate
  • "Record of Email Alarms" table shows sites which raised email alarm in the last 72 hours. It shows site name, who got the email and when.

Monitoring pages


Email alerts

The failover monitor sends an email alert when number of hits from a site exceeds threshold defined in the config twice in a row. People who get these emails are:
  • ATLAS: Michal Svatos
  • CMS: cms-frontier-alarms
  • CVMFS: wlcg-squid-ops
  • CernVM: Dave Dykstra

ATLAS Elastic Search

The Elastic Search hosted in Chicago provides details of job logs which can provide further details in investigation.

How to search details in ATLAS Elastic Search

  1. Open the Elastic Search page (needs user account)
  2. Select "frontier_sql" index
  3. Click on "Add a filter"
  4. Choose "clientmachine" and "is"
  5. Put IP address of the machine as the value.
  6. Save


  • to filter a message
  • to filter out a message
NOT message:value

WLCG-WPAD dashboard

  • link
  • the hits come when something on a WLCG site (or non-WLCG sites running WLCG jobs, e.g. LHC@Home jobs) tries to use proxy autodiscovery to get information about nearest squid or
  • services monitored (each has one server in FNAL and one in CERN):
    • wlcg-wpad - replies positively only at grid sites, and includes backup proxies at those sites after squids that are found. I think the only production use is old-config LHC@Home jobs.
    • lhchomeproxy - like wlcg-wpad except at non-grid sites it replies DIRECT for destinations, so will use cloudflare. Used by current LHC@Home jobs.
    • cernvm-wpad - like lhchomeproxy except at non-grid sites it watches for too many requests in too short of a time (more below) and if so directs them to cernvm backup proxies on port 3125. Used as default for CernVM, cvmfsexec, and soon to be the default configuration for cvmfs if people do not set CVMFS_HTTP_PROXY and are using the cvmfs-config-default configuration rpm.
    • cmsopsquid - like cervnm-wpad except too many requests from non-grid sites in too short of a time get sent to the cms backup proxies on port 3128. Used by CMS opportunistic jobs in the U.S.
  • dashboard content
    • type of info
      • no squid found - wpad returned no squid
      • no org found - nothing found in the geoip database
      • default squid - wpad returned site's squid
      • disabled - if site's squid is disabled (recorded in worker-proxies.json and/or shoal-squids.json)
      • overload - if there are too many requests from one geoip org in too short of a time
    • service names
      • hits per service
      • type of info for each of the service names

CVMFS notes

cvmfs_config showconfig
  • details of CVMFS proxy
cvmfs_talk proxy info

-- MichalSvatos - 2019-05-09

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2020-09-22 - MichalSvatos
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Frontier All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback