XRootD Federation Monitoring

This page documents the XRootd Federation Monitoring project.


  • Create a Dashboard to monitor XRootd at the federation level.


XROOTD monitoring

Implementation of the XRooTD monitoring foresees 3 levels of hierarchy: site, federation, global.

Monitoring on the site level is implemented in the framework of the Tier3 monitoring project and is currently under validation by the ATLAS pilot sites. More information can be found in the attached talk Installation and documetation instructions can be found here https://svnweb.cern.ch/trac/t3mon/wiki/xRootdAndGangliaDetailed. xrootd has a buildin monitoirng functionality which enables reporting of the monitoring data via UDPs.

There are curently two data flows enabled, smry and detailed ones. Smry data flow does not contain per-file information and therefore does not provide enough info about transfered files, users, etc... THe smry flow is currently used by the Monalisa xrootd monitoring repository. In order to have a complete view, CMS (UCSD) had developed collector which allows to read detailed flow and to create per-file reports. Initially these reports was sent via UDP. Recently, new version of the UCSD collector was created in order to report to ActiveMQ. UCSD collector is curently deployed one per federation (US ATLAS and US CMS) , but deployment policy can be flexible depending on the size of the federation and volume of reported data. Reports sent to ActiveMQ are consumed by the WLCG Transfer Dashboard, CMS popularity and new federation monitoring system. The latter should provide complete view of all information required for operating of the xrootd federation and for federation-member site support teams. Requirements for this system are being collected from the ATLAS xrootd community (see presentation of Rob Gardner in the attachment)

The overall data flow is presented in the schema below.

* DataFlowFedMon.jpg:

Data flow for the federation-level monitor

* XrootdFedDataFlow.jpg:

XRootD - prototype file-access report message format

Example of a message currently sent by xrdmon on a file-close event. This is in plain-text, just for readability, JSON is already used for ActiveMQ messages. Some comments are inserted into the message itself using the // syntax.

// 0. Unique record ID constructed from current time; required by Gratia
// 1. Information about file and data-transfer
// These byte-counts are reported by server in the close message when xrootd.monitor contains the 'files' flag.
// Can be rounded up (if number takes more some number of bits).
// These are accumulated by the collector from individual read / write request
// traces that are only sent when xrootd.monitor contains the 'io' flag (requires (or maybe implies) also 'files').
// 2. Information about user / session
server_username=/DC=org/DC=doegrids/OU=People/CN=John R. Hover 47116
// 3. Information about server


  • The message is composed of three "sections":
    1. file and file-access information;
    2. user / session /client information;
    3. information about xrootd server that served the file / session.
  • About byte-counts:
    1. read/write_bytes_at_close are reported by the server and are rounded up if the numbers are too big.
    2. Other read_, read_single_ and read_vector_ fields are summed up internally by the collector whein io traces are enabled. The read_ without single or vector are just summed up values of single and vector entries and are redundant (there because that was the first thing implemented).
    3. For read_vector_" one vector-request is counted as one operation. Further, there are =read_vector_count_ fields that give statistics about number of file-chunks asked for in a single operation.
    4. Configuring GSI authentication in xrootd is somewhat tricky as there is no common rule how security plugins map various elements into XrdSecEntity structure that is then passed down to monitoring. We / USCMS, use grid-mapfiles on some sites and GUMS via xrootd-lcmpas authentication plugin (and get consistent results for those cases). Depending on what ATLAS decides to use, another iteration in XrdSecGsi might be needed.
Additional questions:
  • What is the difference between bytes_read and bytes_read_at_close?

  • What happens if there is something wrong with file reading/transferring. Do we get end of file reading message at all?
It depends how it happens, but Gled will always send at least some info.

a) Client session gets disconnected -- server will send the file-close message and all goes as normal.

b) File/session close UDP message is lost: there are setting for inactivity timeout from a given user ... I think the default is 24 hours. Once this is reached, all the files get reported as closed. Fields that report server-side read/write amount are zero in this case.

c) Severs goes down / gets restarted. Again, we have a timeout for no messages from a server. Additionally, servers can be configured to send periodic "server identification" messages. When these are in use, we can detect that server went down much faster (we/uscms send indent every 5min and consider server to be down when three consecutive messages do not arrive (the time delta between idents is "measured")).

  • In the native xrootd monitorig flow we have no info about failures. However, if we plot timeout transferes/accesses we migt be able to get an idea about efficiency. Is it correct?
Matevz could add a status flag, telling if proper close record was received. To the first approximation

if file_close_reported = (read_bytes_at_close = 0 || write_bytes_at_close = 0) the transfer is OK, otherwise something went wrong. This can be taken to calculate an efficiency.

Discussion about close to real time monitoring

  • What is a suggestion for the format for more real-time view
Matevz suggests to look into http://xrootd.t2.ucsd.edu:4243/?no_same_site

  • Summary of the mail discussion
Close-to-real-time information should be in a form of time-based events (~ once per minute). What is interesting is number of reads/writes, single and vector ones. This is contained in a file close events. Though it can happen that the file is being opened for a long time, in this case single close-file report is not enough. Seeks are useless and should not be included in the aggregated reports. What should be included in the time-based events is fileid and bytes transferred. Matevz suggested to just send "summed up" single & vector reads over the last time bin for files that had been changed. Ilija agreed that it is what is needed , but the problem is that according to Andy this would require significant changes in xrootd an it is not the top priority for the xrootd development. However the discussion is still ongoing.

Requirements and what is realistic to implement based on the current data flow

requirement scope doable or not, based on what we can get from xrootd monitoring data
redirection staitics: fraction of time accesses are local, redirected within region, cloud , or global relevant both for WLCG monitor and federation level monitor OK given we have a proper topology in place
Authentication successes/failures federation-level monitor Currently this info is not available in the data reprted to ActiveMQ, but is available in the ML repository , therefore there is a way to retrive it, might be a part of the smry flow, to be checked
Number of files opened Federation level monitor Is not possible to have it ~ to real time if we have only file close reports. Should be possible if we also get reports when the file is being opened, check with Matevz whether this kind of reports can be added
Distinguish direct access versus copy Relevant either for popularity, or for Federation level monitoring, not relevant for Global Transfer Dashboard Should provide an ability to understand whether file was accessed directly from the remote storage, or instead the copy request had been issues an dthen the file was accessed at the local site. In principle in order to compare direct vs copy, should be enough to compare #read bytes/#of transferred bytes, but # of read bytes should not include local access to files which had been transferred beforehand. The first approximation would be to consider only remote reading
Distinguish local versus WAN relevant both for WLCG monitor and federation level monitor OK, since we can make difference based on server -client domains
Statistics for files actually used and mode of access Federation level monitor OK, is already available in popularity, can be enabled in the federation level monitor as well
User statistics for direct access versus copy Relevant either for popularity, or for Federation level monitoring, not relevant for Global Transfer Dashboard similar to # of bytes, but claculated per user, or in terms of user
For brokerage, cost matrix - To make a decission from where in the federation it will be most efficient to get a file, job broker will need to know cost of each transfer. Currently we define the cost in the simplest way: a number of seconds spent to copy a well defined amount of data from one federated site to another.
ranking plots, sites by data relevant both for WLCG monitor and federation level monitor OK
File lifetime distributions by site - Not clear wheter it belongs here, where this info is supposed to come from, how we know when the file was deleted?
"active" data volume at site, absolute and as a fraction of capacity, where "active" file is one used in the last X weeks/months Federation level monitor or rather popularity application This requirement implies 1).to keep long history for every file access (in order to define active files) , which might not be a good idea for the federation level monitor, 2).knowledge of storage capacity at every site. Where the latter should come from? Not sure whteher it belongs to federation monitoring, might be rather VO-level popularity application which can get aggregated daily data from the federation level monitor through an API. On the other hand can enable transparent navigation from federation level monitor to the popularity plots/tables
plot of file age and deletion (cleanup) , and plot of avg file age at deletion by site Federation level monitor or rather popularity application Same as previous, looks like it belongs to VO-level popularity application. Needs to get data about data cleaning at the site, where is it supposed to come from?
Site availability metrics and ranking plots SSB Though natural place for this information is SSB, federation-level monitor should provide transparent navigation to the SSB plots generating required links from the federation monitor UI

Integration of ALICE xrootd traffic into WLCG transfer dashboard

Xrootd monitoring information from ALICE (namely active transfers information) is going to the Dashboard in 4 steps:

  1. Specially developed filter for MonALISA service extract transfers related messages and puts them to the local message queue
  2. Aggregator program reconstructs the transfer messages from their parts, join them to bigger message and send to another local queue
  3. The aggregated messages are sent to the message broker using stompclt running in daemon mode
  4. On the Dashborad's side comsumer receives the messages and extract transfer information from them

What is being monitored

Monitored objects are transfers between xrootd servers and clients. One server could have several transfers being performed by one client. So all the transfers will appear in the monitoring information separately.

Transfer's statistics includes following information:

  • timestamp (when information was sent to ML)
  • server host
  • client IP (and FQDN, if aggregator ws able to resolve it)
  • read megabytes (for last minute)
  • written megabytes (for last minute)
  • MonALISA site name server belongs to
Please note, that all the transfers' information is provided for the interval of one minute.

Proposed message structure

The monitoring information is encapsulated to the transport message in JSON format having head and body section (to read more about the abstraction used, please see the documentation of stompclt, for the message structure see Message::Message object description).

head section contains message id, timestamp and monitring host name.

body section contains transfers monitoring information itself. It is a JSON encoded to string with json.dumps(). In the consumer it should be restored with json.loads().

Contents of the body section:

   "transfers": [
                "message_id": "550e8400-e29b-41d4-a716-446655440000",    // UUID string
                "timestamp": "123456789",
                "server_host": "host.example.com",
                "site_name_ml": "XRD_SITE",
                "client_ip": "",
                "client_fqdn": "client.phys.net",   // optional, could be missing if hostname was not resolved
                "client_site_alice": "",                  // optional, for future use
                "client_gridsite": "",                     // optional, for future use
                "read_mb": "10.543",
                "written_mb": "7.891",

Messaging topisc used

Now messages are being send to the test broker dashb-mb-test ( stomp://dashb-mb-test.cern.ch:6163) to the topic /topic/xrootd.ALICE.

Federation-level monitoring

Federation-level monitoring is being currently implemented for two persistency backends : ORACLE and Hadoop. ORACLE does not match per-federation deployment model, but is used for quick prototyping of the functionality (understanding which metrics are required, aggregation levels , etc...) and user interface.

Federation-level monitoring should contain :

  • transfer throughput metrics (similar to ones in the WLCG Transfer Dashboard)
  • metrics which are currently available in the MonAlisa repository (so that users would have everything in one single place)
  • user access information (popularity-like)

User access information

A new tab have been added to the dashboard allowing to correlate many different information and rank them depending on a parameter. This information have summarized in 3 different plots and are presented as following.

  • General plot:
    • Total read : Total byte read by a user during a given period.
    • Single read : Byte read by a user using the normal read function.
    • Vector read : Byte read by a user using the vector read function.
    • Fraction read : Average fraction of a file read by the user.
    • Ratio vector read : vector read / (single read + vector read).
  • Access type plot (local VS remote access):
    • Local access : Number of file accessed localy by a user
    • Remote access : Number of file accessed remotely.
    • Total access : Total number of file accessed.
    • Ratio local access : local access / (local access + remote access)
  • Transfer mode plot (transfer VS remote IO):
    • Transfer : Number of transfer executed by the user
    • Remote io : Number of read done by the user using remote IO
    • Total : Total number of operation
    • Ratio remote IO: remote io / (transfer + remote io)
Note that only 1 plot is displayed at the time depending on which ranking attribute you want (ie. orderby option)

Importing data from the ML repository

To be completed by Sergey

Federation-level monitoring with ORACLE backend

One DB schema is created per federation. Currently one is for the CMS federation, another is for ATLAS one. Links to the UI to be provided

Federation-level monitoring with Hadoop/HBase backend

Development state

Services deployment

Topology needs to be fixed (map: domain name -> (site, country)?).

Note that dashboard48 is slow - e.g., its network bandwith is limited to 30Mbit/sec, in which case the process that writes/reads the net work uses 100% vm CPU. Problem cause is unknown.

Hadoop deployments

  • lxfssm4401 - one from IT stuff.
  • dashboard48 - a gateway to lxfssm4401.
  • dashboard07 - owned by dashboard; unstable. Who wants to have admin access?
    • dashboard07, 08, 64, 65
Also see https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=214885 for deployments section.


  1. DAO - possibly implement a new one to use HDFS
  2. Summaries processor:
    • Evaluate Hive and Pig;
      1. Possibility to use Python UDFs to deserialize JSON columns?
      2. If HDFS is adopted - reimplement Hive and Pig to use it;
    • Processing in python: requires DAO to be configured, but uses it intrusively/breaks incaplusation (accesses protected/private fields like HBase connection directly)
      • Probably at least a check is needed to prevent its usage with Oracle DAO.
      • Most definitely no need to adapt for Oracle.
  3. Overall evaluation - performance, accuracy, other differences with Oracle backend.
    • Can we measure reliability/robustness?
  4. Try out/evaluate pydoop, pypig.
Useful links:

Configuring HBase DAO

Is in /opt/dashboard/etc/dashboard-dao/dashboard-dao.cfg (or the respective DAO config for service).

One has to configure table names with, possibly, table prefixes (since there's no namespaces nor 'use db' stuff like in SQL), column family name and formats to store data into tables. The available formatters are: $ 'flat': concatenates all fields in specific order, separated by ':', and puts in ':c" column. $ 'column': stores all fields in different columns, with column being concatenation of prespecified column family and a field name. $ 'json': stored JSON-serialized message in ':j' column.

Reference (current state, may change):

Section param name default value meaning
[hbase-thrift] hostname must specify A host to connect to (e.g., hostname=lxfssm4401.cern.ch )
  port must specify A port (usually 9090)
  table.prefix None ( uses no prefix ) A table prefix to use as a namespace. With prefix being set to "dashboard", all tables will be prefixed with "dashboard_"
[transfers-dao-hbase] messages.table.name must specify The table name to store messages
  messages.formatter.name must specify Formatter for messages
  summaries.tables.prefix must specify Prefix for summaries table, in addition to table.prefix specified in [hbase-thrift] section. The final tables will be, e.g., [_]_10m for 10 minutes summaries, etc
  summaries.formatter.name must specify Formatter for summaries
  column.family.name must specify Column family (can be arbitrary, should be as short as possible (a one char like 'd' is okay), but must match the table defined in HBase schema)



  • ALICE sites internal naming can be found from ALICE ldap server
ldapsearch -x -LLL -H ldap://aliendb06a.cern.ch:8389 -b ou=Sites,o=alice,dc=cern,dc=ch

However, ALICE internal naming may appear to be different from the custom WLCG/OSG sites naming.

Via clicking to XML one can get


Get file in the local directory:

wget -O osg_test "http://is.grid.iu.edu/cgi-bin/status.cgi?format=xml&grid_type=OSG"

paste -s -d ';' < osg_sites | awk -F \/_/g | sed s/\ osg_sites.txt

Parcer is attached: https://twiki.cern.ch/twiki/pub/LCG/XrootdMonitoring/parce_osg.csh

  • WLCG sites can be found in GOCDB:

Get file in the local directory (in tcsh):

wget --no-check-certificate -O serv_list "https://goc.egi.eu/gocdbpi/public/?method=get_service_endpoint"

cat serv_list | awk -F \/_/g | sed s/\ serv_list.txt

Parcer is attached: https://twiki.cern.ch/twiki/pub/LCG/XrootdMonitoring/parce_gocdb.csh

  • However, there are sites that are not-WLCG resources and they should be added in addition to the matrix: is LBL

LOEWE-CSC is called "HHLR_GU" on ALICE map is Madrid

  • Around 25% of IPs are internal and can not be easily linked through LCG/OSG or ALICE sites list
Some other tools need to be used for that cases.

  • Association of the IP to the Site
Script is attached: https://twiki.cern.ch/twiki/pub/LCG/XrootdMonitoring/find_site_new.csh

The script is based on the Linux command host to get the host name and association to the Site through the domain part.

Some issues:

- command host (or nslookup) has some timeout and there is some low percentage of failure due-to timeout. For this case one has to repeat command second time

- there are some rare case when hosts are not added to rDNS and in that case other tricks should be used

- French sites are the special case. For identification one has to use not only domain part (in2p3.fr) but also the first field of the hostname, f.e.

clr means IN2P3-LPC (example: clrwn71.in2p3.fr)

cc means IN2P3-CC (example: ccwsge1317.in2p3.fr)

mar : IN2P3-CPPM

lyo : IN2P3-IPNL

lpnhe, lal,*cpp* : different GRIF institutions


sbg : IN2P3-IRES

lapp : IN2P3-LAPP

lpsc : IN2P3-LPSC

Topic attachments
I Attachment History Action SizeSorted ascending Date Who Comment
Unknown file formatcsh parce_osg.csh r1 manage 0.5 K 2012-12-05 - 11:26 OlgaKodolova  
Unknown file formatcsh parce_gocdb.csh r1 manage 0.8 K 2012-12-05 - 11:26 OlgaKodolova  
Unknown file formatcsh find_site_new.csh r1 manage 3.8 K 2012-12-05 - 12:20 OlgaKodolova  
JPEGjpg DataFlowFedMon.jpg r1 manage 50.9 K 2012-10-11 - 09:12 AlexandreBeche  
JPEGjpg XrootdFedDataFlow.jpg r1 manage 58.2 K 2012-10-11 - 09:12 AlexandreBeche  
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2014-02-14 - AlexandreBeche
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback