CMS XRootD Architecture and AAA
This is the homepage for the XRootD-based federations in CMS also called AAA.
Installation/Upgrade Best Effort Advisory
- In view of Run3 and Transition to the token-based transition, upgrading to the xrootd version > 5 is recommended
- It is encouraged for OSG sites to install the Shoveler for the detailed xrootd monitoring (file read/write through xrootd/https/davs other than FTS). See the appropriate link below. More information will be posted for the EU/Asia sites.
- 5.3.4 as of Jan 19 is a good candidate. However, this version has a throttle issue (See this xrootd issue
). Probably, most sites don't use the throttle, so this should not be a big problem.
- 5.4.0 has the fix for the throttle issue in earlier versions, but it has a high load issue on the redirector (See this xrootd issue
). The fix is on its way to the next version, it looks like.
- 5.4.1 is released with improvement in the TLS issue and the redirector issue
- 5.4.2
is released:
- XRootD has a bug impacting CEPH and there is a fix for this in v5.5 which was backported to v5.4.2. For sites with CEPH, this version is recommended.
- OSG sites: Use OSG 3.6 repo for 5.4.2-1.1 (the throttling patches)
- non-OSG sites: Use the stable xrootd 5.4.2 + xrootd-cmstfc from the OSG osg-upcoming-development repo
- 5.4.3
has a bug impacting DPM sites, XRootD#1739
(DPM sites please use v5.4.2);
- 5.5.1
has a bug impacting Centos 7 and it requires restart every 6 hours or use RHEL8 or the likes of RHEL8
- 5.5.2
This release has the bug fixed mentioned above in the 5.5.1 release; the release has a checksum issue
- 5.5.3
This release is the checksum bug fix release of 5.5.2.
- 5.5.4
This release has the bug fix that is mentioned in the redirector disconnection issue with the TLS
Documentation
For Users
We have the following user documentation available also:
- XRootD Client Usage - How to utilize the current infrastructure, on your desktop, in ROOT or in a CRAB job.
For Admins
The following documentation is aimed at the sysadmins of CMS sites:
For Operators
Introduction
CMS is exploring a new architecture for data access, emphasizing the following three items:
- Reliability: The end-user should never see an I/O error or failure propagated up to their application unless no USCMS site can serve the file. Failures should be caught as early as possible and I/O retried or rerouted to a different site (possibly degrading the service slightly).
- Transparency: All actions of the underlying system should be automatic for the user – catalog lookups, redirections, reconnections. There should not be a different workflow for accessing the data "close by" versus halfway around the world. This implies the system serves user requests almost instantly; opening files should be a "lightweight" operation.
- Usability: All CMS application frameworks (CMSSW, FWLite, bare ROOT) must natively integrate with any proposed solution. The proposed solution must not degrade the event processing rate significantly.
- Global: A CMS user should be able to get at any CMS file through the Xrootd service.
To achieve these goals, we will be pursuing a distributed architecture based upon the
XRootD protocol and software developed by SLAC. The proposed architecture is also similar to the current data management architecture of the ALICE experiment. Note that we specifically did not put scalability here - we already have an existing infrastructure that scales just fine. We have no intent on replacing current CMS data access methods for production.
We believe that these goals will greatly reduce the difficulty of data access for physicists on the small or medium scale. This new architecture has four deliverables for CMS:
- A production-quality, global XRootD infrastructure.
- Fallback data access for jobs running at the T2.
- Interactive access for CMS physicists.
- A disk-free data access system for T3 sites.
Architecture
To explore the
XRootD architecture, we put together a prototype for the WLCG, involving CMS sites worldwide and all the relevant storage technologies. This prototype wrapped up in January 2011, and we are moving to a regional redirector-based system. This injects another layer into the hierarchy which will make sure requests keep in a local network region if possible.
Local-region redirection
The image below shows the communication paths for a user application querying the regional redirector when the desired file is within the region. First (1), the user application attempts to open the file in the regional redirector. If the regional redirector does not know the file's location, it will then query all of the logged-in sites (2). In this diagram, Site A responds that it has the file, so the redirector redirects (3) the client to Site A's xrootd server. Finally, the client contacts Site A (4) and starts reading data (5). This is all implemented within the Xrootd client; no user interaction is necessary.
Cross-region redirection
The image below shows the communication paths for a user application querying the regional redirector when the desired file
is not within the region. This proceeds as in the previous case, except all local sites respond they do not have the file. Then, the regional redirector will contact the other regions (3); if the file location is not in cache, the other regional redirector will query its sites (4). In this example, the user is redirected to Site C (5) and successfully opens the file (6 and 7).
Fallback Access
In the prototype, most sites won't use Xrootd as their primary method; instead, they will use it primarily as a fallback. The image below shows how the file access would work for such a site:
Other notes related to AAA
Global and Regional Redirectors
The service availability monitoring for the global and regional redirectors
Participating Sites
List of all participating sites (subscribed) in the AAA is monitored
here
.
Tests and Issues
Scale tests
Historical records
- Tests for the Xrootd Demonstrator (back to 2010 initiative) we've performed are documented on this page.
- We are also trying to document all the issues we observe with the xrootd-based system here: CmsXrootdIssues.
- We record the CMSSW/ROOT I/O improvements needed here: CmsRootIoIssues.
Presentations and Workshops
- Presentations:
- XRootD Workshop in UCSD 2015:
- OSG AHM 2014: Storage Federations (see Friday's agenda)