Xrootd Production and Integration

The xrootd data access system is oriented for the end-user and we are concerned with maintaining a satisfactory end-user experience and maintaining the stability of the system. Toward this goal, we partition sites into production and integration infrastructures. The integration infrastructure allows us to send controlled tests to a participating site to help it stabilize operations without worrying about exposing them to chaotic user load. The production infrastructure should be stable, allow all CMS users, and export any file registered to the site in PhEDEx.

Once a site feels it is ready (for example, if it passes the production checklist below) and passes our criteria, it will be allowed into the production infrastructure.

Xrootd Production Checklist

System administrators are expected to go through the following checklist in order to verify their site's xrootd install.

  • Verify your xrootd server is not blocked by a firewall. A user should be able to access it from the public internet.
  • Verify the xrootd server requires GSI-authentication for offsite clients. Any CMS user with a grid certificate should be able to read official CMS files.
  • Try opening and downloading a few files using ROOT and the xrdcp client, respectively.
  • Try writing a file via xrootd and make sure it fails. We suggest that xrootd remain read-only (although sites can certainly ignore this suggestion).
  • Any file registered to your site in PhEDEx should be available through Xrootd.
  • Verify Xrootd exports CMS namespace, not the site namespace. That is, file names should start with /store when read through Xrootd.
  • Set up a mapping for the test namespace. Have /store/test/xrootd/$SITENAME/store/(.*) map to /store/$1. This allows us to query the redirector for a specific file and only get a response from hosts at $SITENAME; an important characteristic for our testing.
  • Join your site to a regional redirector within 50ms RTT of your xrootd server.
  • Read the documentation on throttling Xrootd. Consider the impact of a malicious (or uneducated user). Do you feel you have comfortable controls for the following dimensions: site bandwidth utilized, IOPS, namespace queries.
  • Read the documentation on monitoring. You will want to keep an eye on these, and may want to integrate into your own site monitoring.

Production Criteria

Your site must pass the following criteria in order to be listed as production. These will be re-evaluated monthly at all production sites in order to maintain a minimal quality of service level in production. Eventually, these items will be integrated into the normal site status board.

  1. At least three xrootd hosts at T2 sites and two hosts for T3 sites are required.
    • For T2s, the expected load should require two servers to handle; T3 sites should need one server. The additional one is for redundancy.
    • This requirement is so an end-user can expect reasonable performance when accessing official CMS data.
    • Each server ought to be equivalent to a 4 core machine with at least 8GB of RAM.
    • Less servers can be used for sites serving only unique namespace. I.e., a T3 has a private namespace which doesn't conflict with the official CMS namespace.
  2. 95% availability in the redirector as measured by heartbeat tests.
    • The heartbeat tests frequently (approximately every 10 minutes) download a few bytes from a single, known file.
    • This will be done directly against each known server (not through the redirector), and, in a separate test, via the integration redirector.
  3. 95% availability in the random file tests.
    • The random file test will attempt to download a random file registered at the site in PhEDEx approximately once an hour.
    • This will be done via the regional redirector.
  4. 95% success rate in xrootd JobRobot.
    • A CRAB task will be run approximately daily at one site in the region (T2_US_Nebraska for the US) on the JobRobot test utilizing files from the remote site via the redirector.
    • Success rate is the percentage of successful jobs run that day.

If one of these criteria are not currently measured (for example, we estimate the JobRobot test won't be available until May 1) then the site is excused from the criteria.

All production sites must also be in the integration in order for the monitoring to function.

Edit | Attach | Watch | Print version | History: r7 | r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2011-03-09 - BrianBockelman
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback