Operations Guide of AAA

Major goals

  • project focuses on development and deployment of storage resources that are accessible for any data, anywhere at anytime.
  • for CMS specifically, the goal is to enable easier, faster and more efficient processing and analysis of data recorded at the LHC through more flexible use of computing resources.

Terminology

  • XRootD protocol: core SW technology for federating storage
  • Federation: participating WLCG sites joining AAA concept
  • Topology: hierarchical structure of the federated data access accross participating sites
  • Redirectors: topology check-points subscribing upstream in the hierarchy of redirectors; redirecting access to local data storage

Federated Storage Infrastructure

  • XRootD infrastructure that spans all of the Tier-1 and Tier-2 sites in EU and US CMS;
  • Each site’s xrootd server is interfaced with the local storage system, allowing it to export the CMS namespace;
  • site servers subscribe to a local redirector (recommended)
    • the local redirectors from each site then subscribe to a redundant regional redirector:
      • US: hosted by FNAL (DNS round-robin alias cmsxrootd.fnal.gov)
      • EU: Bari, Pisa, Paris (DNS round-robin alias xrootd-cms.infn.it)
      • Transitional : hosted by CERN (DNS round-robin alias cms-xrd-transit.cern.ch)
  • this forms a large tree-structured federated storage system;
    • a user can request a file from the regional redirector, which will then query the child nodes in the tree and redirect the user to a server that can serve the file.
    • entire AAA infrastructure overlays on top of existing storage systems, allowing users access to any on-disk data without knowing its location. Simplified version of this process is shown in the attached figure:

Monitoring

See details at XrootD Monitoring

Monitoring of Xrootd servers and data transfers

What is currently being monitored:

  1. Xrootd summary monitoring - tracking server status
  2. Xrootd detailed monitoring - tracking user sessions and file transfers
    • TCP Gled Monitoring (See details below)
    • New xrootd collector (not in production yet)

Monitoring Components (System requirements)

Summary Monitoring
There are two dashboards that are monitoring the AAA.
  • The first one is to check the service availability of the global and regional redirectors. The grafana dashboard is in the link, Redirector Service Availability
  • The second one is to check the xrootd status (version and role etc) of all xrootd servers and redirectors. The grafana dashboard is in the link, AAA Subscriptions
Detailed Monitoring
New Xrootd Collector will be deployed to replace the TCP GLED xrootd collector described below.
  • Xrootd servers report user logins, file opens and read ops via binary UDP -- total for cms is between 200 - 600 kB/s.
  • UDP-sucker-TCP-server: listens for UDP packets, tags them with receive time + source server, writes them into a root tree (to be able to play them back and debug the stream / collector (up to 10GB / day) and serves them to connected collectors over TCP. Typically, one can run the production collector (next service) + testing / live-monitoring one on desktop (like Matevz does).
  • TCP GLED collector: aggregates the messages and builds in-memory representation of everything that is going on in a federation. Reports the following:
    • serves HTML page of currently opened files (has to be cert protected via apache);
    • file-access-report on file close to:
      • AMQ broker@CERN
      • Gratia collector at UNL (more later)
      • write report into ROOT trees for detailed analysis (~couple 100 MBs / month).
  • UDP collector of "very detailed I/O information" (fully unpacked vector reads, for detailed analysis of read offsets and access patterns). All UCSD servers (~20) and one server from UNL are also reporting in this format. For this, I only write out ROOT trees (about GB per month).
  • violation policy concerns by IGTF, we run two collectors:
  • different config implementation depending on the storage, e.g.: dCache and DPM sites

Instructions for Operators and Shifters

  • Redirector Service Availability Test and Monitoring :
    • We test 4 redirector DNS alias instances and 9 redirector host instances :
      • GLOBAL: cms-xrd-global.cern.ch
      • EU: xrootd-cms.infn.it
      • US: cmsxrootd.fnal.gov
      • Transitional Federation : cms-xrd-transit.cern.ch
    • The test is running as a cron
      uploadmetricGeneral.sh
      on vocms036
    • The cron script runs two python scripts:
      XRDFED-kibana-probe_JSON_General.py
      to run xrdfs <REGIONAL_REDIRECTOR | DNS_ALIASED_REDIRECTOR>:1094 query config <version | role> and xrdcp -d 2 -f -DIReadCacheSize 0 -DIRedirCntTimeout 180 root:// <REGIONAL_REDIRECTOR | DNS_ALIASED_REDIRECTOR>/<SAM_test_file> /dev/null and produce the output to send to the InfluxDB data source
      monit_idb_cmsaaa
      by using
      send_metrics.py
    • All scripts can be found from the RedirectorServiceAvailability in the github.
    • In order to run
      XRDFED-kibana-probe_JSON_General.py
      , cms service certificate and service proxy (needed depending on sites policy and cms user mapping) are needed. The proxy is created by a systemd service(TODO: which one?) When we transition to the token authentication, we will need to change this to the token-based authentication
    • The test results are monitored in the Redirector Service Availability Grafana

  • AAA Subscriptions Test and Monitoring :
    • Sites need to subscribe to the assigned regional redirector as is described in the Redirector Subscription
    • There are two most upstream redirectors, one for production sites, the global redirector, and the other for the transitional sites, the transitional redirector
    • The test is running as a cron
      probe_create_send_aaa_metrics.sh
      on vocms036
    • The cron script runs two python scripts:
      create_fedmaps.py
      which executes the xrdmapc command against the global and the transitional redirectors to get a list of all xrootds subscribing to them and run xrdfs to obtain the xrootd version and the xrootd role information. The create_fedmaps.py also reads storage.json file to get the site backend storage information. All the collected information by create_fedmaps.py is written to a json file and the json file content is sent to the elasticsearch db, , by using
      aaa_federation.py
    • All scripts can be found from the FedProbeSendAAAMetrics in the github.
    • In order to run
      create_fedmaps.py
      , cms service certificate and service proxy (needed depending on sites policy and cms user mapping) are needed. The proxy is created by a systemd service(TODO: which one?) When we transition to the token authentication, we will need to change this to the token-based authentication
    • The test results are monitored in the AAA Subscriptions Grafana

Managing xrootd machines

For machine access, puppet configs, crontabs related info, see CompOpsCentralServicesXrootd

Scale tests

Operations and troubleshooting

Troubleshooting guide can be found at https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsAAATroubleshootingGuide

AAA support groups

How to examine badly behaving sites in AAA

How to learn storage backend and version

  • It's important from AAA point of view to know site's storage technologies (dCache, DPM etc..) and version. In case a site is using very old version of a storage system, we need to open GGUS ticket and initiate a system update.
  • In the past, we go to AAAOps github page and have a look at README.md and, to check the storage backend, run the script on the lxplus. But this is not available anymore.
  • At the moment, the best option to check the storage backend information is using the information hard-coded in the storage.json file. The file can be found from SITECONF/storage.json for an example
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Regional_Xrootd.png r1 manage 51.7 K 2015-02-24 - 09:39 MericTaze  
Edit | Attach | Watch | Print version | History: r36 < r35 < r34 < r33 < r32 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r36 - 2021-12-23 - BockjooKim
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback