Operations Guide of AAA

Major goals

  • project focuses on development and deployment of storage resources that are accessible for any data, anywhere at anytime.
  • for CMS specifically, the goal is to enable easier, faster and more efficient processing and analysis of data recorded at the LHC through more flexible use of computing resources.

Terminology

  • XRootD protocol: core SW technology for federating storage
  • Federation: participating WLCG sites joining AAA concept
  • Topology: hierarchical structure of the federated data access accross participating sites
  • Redirectors: topology check-points subscribing upstream in the hierarchy of redirectors; redirecting access to local data storage

Federated Storage Infrastructure

  • XRootD infrastructure that spans all of the Tier-1 and Tier-2 sites in EU and US CMS;
  • Each site’s xrootd server is interfaced with the local storage system, allowing it to export the CMS namespace;
  • site servers subscribe to a local redirector (recommended)
    • the local redirectors from each site then subscribe to a redundant regional redirector:
      • US: hosted by FNAL (DNS round-robin alias cmsxrootd.fnal.gov)
      • EU: Bari, Pisa, Paris (DNS round-robin alias xrootd-cms.infn.it)
      • Transitional : hosted by CERN (DNS round-robin alias cms-xrd-transit.cern.ch)
  • this forms a large tree-structured federated storage system;
    • a user can request a file from the regional redirector, which will then query the child nodes in the tree and redirect the user to a server that can serve the file.
    • entire AAA infrastructure overlays on top of existing storage systems, allowing users access to any on-disk data without knowing its location. Simplified version of this process is shown in the attached figure:

Monitoring

See details at XrootdMonitoring

Monitoring of Xrootd servers and data transfers

What is currently being monitored:

  1. Xrootd summary monitoring - tracking server status
    • MonALISA-based monitoring webpages for Xrootd are located here: http://xrootd.t2.ucsd.edu/
    • More details / time-series can be accessed via the ML Java GUI, select group xrootd_cms after the window comes up.
  2. Xrootd detailed monitoring - tracking user sessions and file transfers
    Detailed file-access monitoring is implemented as a custom C++ application, !XrdMon. This also provides a HTML view of currently open files ( instructions for URL parameters). When a files is closed a report is sent to CERN Dasboard and, for US sites, to OSG Gratia for further processing.
  3. Dashboard-based monitoring
    The Dashboard receives file access report records from XrdMon collectors and integrates them as a part of its WLCG-wide transfer monitoring. The development version is available here: http://dashb-wdt-xrootd.cern.ch/ui/#vo=(cms)#

Monitoring Components (System requirements)

Summary Monitoring
  • xrd-rep-snatcher -- perl script that receives summary UDP packets (quasi xml)), normalizes the names, times, etc, calculates rates and forwards the data to ML service. This uses about 10% CPU now, with close to 2000 machines reporting every 30s or so.
  • ML service -- receives data from xrd-rep-snatcher and other sources, keeps memory cache for last 4 hours (or so) and allows one to do detailed real-time plots of what is going on. This is java, set to use 4GB of ram.
  • ML repository -- stores data from the service into Postgres, provides long-term storage and web-interface for plots. This uses 4GB ram for java + 8GB ram for postgres, to make it reasonable fast. This requires fast disks, we have 4 SAS disks in raid-5 configuration, currently using about 300+ GB for 2 years worth of data. SSD disks don't work, ours died after 4 months, same was reported by ALICE, they burned three before going to a SAS raid-5.

Detailed Monitoring
  • Xrootd servers report user logins, file opens and read ops via binary UDP -- total for cms is between 200 - 600 kB/s.
  • UDP-sucker-TCP-server: listens for UDP packets, tags them with receive time + source server, writes them into a root tree (to be able to play them back and debug the stream / collector (up to 10GB / day) and serves them to connected collectors over TCP. Typically, one can run the production collector (next service) + testing / live-monitoring one on desktop (like Matevz does).
  • TCP GLED collector: aggregates the messages and builds in-memory representation of everything that is going on in a federation. Reports the following:
    • serves HTML page of currently opened files (has to be cert protected via apache);
    • file-access-report on file close to:
      • AMQ broker@CERN
      • Gratia collector at UNL (more later)
      • write report into ROOT trees for detailed analysis (~couple 100 MBs / month).
  • UDP collector of "very detailed I/O information" (fully unpacked vector reads, for detailed analysis of read offsets and access patterns). All UCSD servers (~20) and one server from UNL are also reporting in this format. For this, I only write out ROOT trees (about GB per month).
  • violation policy concerns by IGTF, we run two collectors:
  • different config implementation depending on the storage, e.g.: dCache and DPM sites

Summary and Detailed Monitoring Links

Filtered queries

  • This is passed as a "like" query to postgres, so % is a wildcard.

Aggregated Reports (Plots)

Detailed Monitoring Links

Dashboard

CMS xrootd dashboard can be found at http://dashb-cms-xrootd-transfers.cern.ch/

Kibana a.k.a. shifters instructions

  • testing 4 redirector DNS alias instances and 9 redirector host instances :
    • GLOBAL: cms-xrd-global.cern.ch
    • EU: xrootd-cms.infn.it
    • US: cmsxrootd.fnal.gov
    • Transitional Federation : cms-xrd-transit.cern.ch
  • two hosts behind DNS alias, sort of HA service, to produce XMLs which will be pulled by Kibana
    • xrdfedmonitor-cms.cern.ch (vocms037 and vocms038)
    • script: /var/www/XRDFED-kibana-probe.py
      • it executes two commands as root
      • xrd <REGIONAL_REDIRECTOR | DNS_ALIASED_REDIRECTOR> query 1 a
      • xrdcp -d 2 -f -DIReadCacheSize 0 -DIRedirCntTimeout 180 root:// <REGIONAL_REDIRECTOR | DNS_ALIASED_REDIRECTOR>/<SAM_test_file> /dev/null
        • SAM tests file : /store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root
      • implementation: cms service certificate; service proxy (needed depending on sites policy and cms user mapping)
  • xml based information visualized via Kibana
    • https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::XrootD
    • If you want to add notifications :
      • You need to subscribe to ai-admins e-group
      • Once you have approval to ai-admins group, you will have to add notifications to metric manager. (You have to be in CERN's network or tunnel to access metric manager)
      • When you are in metric manager, first click on "login" on the right
      • Then click on "manage", "add notifications"
      • Fill the areas based on this
  • Shifter instructions: https://twiki.cern.ch/twiki/bin/viewauth/CMS/CMSCriticalServiceXrootd

Scale tests

Operations and troubleshooting

Troubleshooting guide can be found at https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsAAATroubleshootingGuide

New environment setup

AAA support groups

Managing xrootd machines

For machine access, puppet configs, crontabs related info, see CompOpsCentralServicesXrootd

How to examine badly behaving sites in AAA

How to learn storage backend and version

  • It's important from AAA point of view to know site's storage technologies (dCache, DPM etc..) and version. In case a site is using very old version of a storage system, we need to open GGUS ticket and initiate a system update.
  • Please go to AAAOps github page and have a look at README.md
  • To run the script, use lxplus. (not vocms037)
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Regional_Xrootd.png r1 manage 51.7 K 2015-02-24 - 09:39 MericTaze  
Edit | Attach | Watch | Print version | History: r28 < r27 < r26 < r25 < r24 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r28 - 2019-07-10 - DonataMielaikaite
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback