Operations Guide of AAA
Major goals
- project focuses on development and deployment of storage resources that are accessible for any data, anywhere at anytime.
- for CMS specifically, the goal is to enable easier, faster and more efficient processing and analysis of data recorded at the LHC through more flexible use of computing resources.
Terminology
- XRootD protocol: core SW technology for federating storage
- Federation: participating WLCG sites joining AAA concept
- Topology: hierarchical structure of the federated data access accross participating sites
- Redirectors: topology check-points subscribing upstream in the hierarchy of redirectors; redirecting access to local data storage
Federated Storage Infrastructure
- XRootD infrastructure that spans all of the Tier-1 and Tier-2 sites in EU and US CMS;
- Each site’s xrootd server is interfaced with the local storage system, allowing it to export the CMS namespace;
- site servers subscribe to a local redirector (recommended)
- the local redirectors from each site then subscribe to a redundant regional redirector:
- US: hosted by FNAL (DNS round-robin alias
cmsxrootd.fnal.gov
)
- EU: Bari, Pisa, Paris (DNS round-robin alias
xrootd-cms.infn.it
)
- Transitional : hosted by CERN (DNS round-robin alias
cms-xrd-transit.cern.ch
)
- this forms a large tree-structured federated storage system;
- a user can request a file from the regional redirector, which will then query the child nodes in the tree and redirect the user to a server that can serve the file.
- entire AAA infrastructure overlays on top of existing storage systems, allowing users access to any on-disk data without knowing its location. Simplified version of this process is shown in the attached figure:
Monitoring
See details at
XrootD Monitoring
Monitoring of Xrootd servers and data transfers
What is currently being monitored:
- Xrootd summary monitoring - tracking server status
- Xrootd detailed monitoring - tracking user sessions and file transfers
- TCP Gled Monitoring (See details below)
- New xrootd collector (not in production yet)
Monitoring Components (System requirements)
Summary Monitoring
There are two dashboards that are monitoring the AAA.
- The first one is to check the service availability of the global and regional redirectors. The grafana dashboard is in the link, Redirector Service Availability
- The second one is to check the xrootd status (version and role etc) of all xrootd servers and redirectors. The grafana dashboard is in the link, AAA Subscriptions
Detailed Monitoring
New Xrootd Collector will be deployed to replace the TCP GLED xrootd collector described below.
- Xrootd servers report user logins, file opens and read ops via binary UDP -- total for cms is between 200 - 600 kB/s.
- UDP-sucker-TCP-server: listens for UDP packets, tags them with receive time + source server, writes them into a root tree (to be able to play them back and debug the stream / collector (up to 10GB / day) and serves them to connected collectors over TCP. Typically, one can run the production collector (next service) + testing / live-monitoring one on desktop (like Matevz does).
- TCP GLED collector: aggregates the messages and builds in-memory representation of everything that is going on in a federation. Reports the following:
- serves HTML page of currently opened files (has to be cert protected via apache);
- file-access-report on file close to:
- AMQ broker@CERN
- Gratia collector at UNL (more later)
- write report into ROOT trees for detailed analysis (~couple 100 MBs / month).
- UDP collector of "very detailed I/O information" (fully unpacked vector reads, for detailed analysis of read offsets and access patterns). All UCSD servers (~20) and one server from UNL are also reporting in this format. For this, I only write out ROOT trees (about GB per month).
- violation policy concerns by IGTF, we run two collectors:
- different config implementation depending on the storage, e.g.: dCache and DPM sites
Instructions for Operators and Shifters
- Redirector Service Availability Test and Monitoring :
- AAA Subscriptions Test and Monitoring :
Managing xrootd machines
For machine access, puppet configs, crontabs related info, see
CompOpsCentralServicesXrootd
Scale tests
Operations and troubleshooting
Troubleshooting guide can be found at
https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsAAATroubleshootingGuide
AAA support groups
- Hypernews: hn-cms-wanaccess@cernNOSPAMPLEASE.ch
- interface between site admins and CMS xrootd experts tackle unexpected behaviors in their local infractructure of redirector(s)
- xrootd client/servers and storage setup and other WAN related issues
- GGUS: https://ggus.eu/?mode=ticket_cms
- Type of issue: "CMS_Datatransfers"
- Support Unit "CMS AAA - WAN Access"
- usually targeted to site problems affecting federated access and helping review sites' configuration within AAA 'standards' (not only within AAA, also if they have just standalone xrootd cluster not joining federation), storage.xml configuration and TFC configuration
- in all cases, first line of defense for operation issues is cms-comp-ops-transfer-team@cernNOSPAMPLEASE.ch
- Redirectors Admin contacts:
- E-group: cms-service-xrootd@cernNOSPAMPLEASE.ch
- contact to report critical incidents, -GLOBAL, -EU, -US and CRC will receive the message.
- scheduled maintenance any of the redirectors which may affect AAA file access in general (always CC hn-wan-access@cernNOSPAMPLEASE.ch)
- Mattermost: XRootD for CMS Sites
- Slack: aaa
How to examine badly behaving sites in AAA
- The following link shows the production federation metric : Top left panel of AAA Subscriptions
- The sites which are labeled as "red" are failing metric conditions and hence they are candidate for the transitional federation. ( Transitional Federation ). However, if the site has upgraded its xrootd, there might be two entries, one with "red" and one with "green". In this case, the extraneous "red" entry can be ignored.
- The metric conditions for being a bad site are following :
- AAA-related ticket in GGUS open for longer than two weeks.
- SAM xrootd access test < 50% for two weeks.
- Hammer Cloud (HC) xrootd test success rate < 70% for two weeks
- If site is failing or a T3:
- Open ticket (cc CMS WAN Access and TT) to site asking them to change their redirector
- Use xrdmapc to figure out if site has correctly updated subscription
How to learn storage backend and version
- It's important from AAA point of view to know site's storage technologies (dCache, DPM etc..) and version. In case a site is using very old version of a storage system, we need to open GGUS ticket and initiate a system update.
- In the past, we go to AAAOps github page
and have a look at README.md and, to check the storage backend, run the script on the lxplus. But this is not available anymore.
- At the moment, the best option to check the storage backend information is using the information hard-coded in the storage.json file. The file can be found from SITECONF/storage.json
for an example