XRootD Federation Monitoring
This page documents the XRootd Federation Monitoring project.
Motivations
- Create a Dashboard to monitor XRootd at the federation level.
Links
XROOTD monitoring
Implementation of the
XRooTD monitoring foresees 3 levels of hierarchy: site, federation, global.
Monitoring on the site level is implemented in the framework of the Tier3 monitoring project and is currently under validation by the ATLAS pilot sites. More information can be found
in the attached talk Installation and documetation instructions can be found here
https://svnweb.cern.ch/trac/t3mon/wiki/xRootdAndGangliaDetailed
. xrootd has a buildin monitoirng functionality which enables reporting of the monitoring data via UDPs.
There are curently two data flows enabled, smry and detailed ones. Smry data flow does not contain per-file information and therefore does not provide enough info about transfered files, users, etc... THe smry flow is currently used by the Monalisa xrootd monitoring repository. In order to have a complete view, CMS (UCSD) had developed collector which allows to read detailed flow and to create per-file reports. Initially these reports was sent via UDP. Recently, new version of the UCSD collector was created in order to report to
ActiveMQ. UCSD collector is curently deployed one per federation (US ATLAS and US CMS) , but deployment policy can be flexible depending on the size of the federation and volume of reported data. Reports sent to
ActiveMQ are consumed by the WLCG Transfer Dashboard, CMS popularity and new federation monitoring system. The latter should provide complete view of all information required for operating of the xrootd federation and for federation-member site support teams. Requirements for this system are being collected from the ATLAS xrootd community (see
presentation of Rob Gardner in the attachment)
The overall data flow is presented in the schema below.
*
DataFlowFedMon.jpg:
Data flow for the federation-level monitor
*
XrootdFedDataFlow.jpg:
XRootD - prototype file-access report message format
Example of a message currently sent by
xrdmon
on a file-close event. This is in plain-text, just for readability, JSON is already used for
ActiveMQ messages. Some comments are inserted into the message itself using the
//
syntax.
#begin
// 0. Unique record ID constructed from current time; required by Gratia
//
unique_id=xrd-1341778615736000
//
// 1. Information about file and data-transfer
//
file_lfn=/atlas/dq2/user/ilijav/HCtest/user.ilijav.HCtest.1/group.test.hc.NTUP_SMWZ.root
file_size=797152257
start_time=1341778586
end_time=1341778615
// These byte-counts are reported by server in the close message when xrootd.monitor contains the 'files' flag.
// Can be rounded up (if number takes more some number of bits).
read_bytes_at_close=797152257
write_bytes_at_close=0
// These are accumulated by the collector from individual read / write request
// traces that are only sent when xrootd.monitor contains the 'io' flag (requires (or maybe implies) also 'files').
read_bytes=797152257
read_operations=190
read_min=234497
read_max=8388608
read_average=4195538.194737
read_sigma=418468.184710
read_single_bytes=797152257
read_single_operations=190
read_single_min=234497
read_single_max=8388608
read_single_average=4195538.194737
read_single_sigma=418468.184710
read_vector_bytes=0
read_vector_operations=0
read_vector_min=0
read_vector_max=0
read_vector_average=0.000000
read_vector_sigma=0.000000
read_vector_count_min=0
read_vector_count_max=0
read_vector_count_average=0.000000
read_vector_count_sigma=0.000000
write_bytes=0
write_operations=0
write_min=0
write_max=0
write_average=0.000000
write_sigma=0.000000
//
// 2. Information about user / session
//
user_dn=
user_vo=atlas
user_role=
user_fqan=usatlas
client_domain=ochep.ou.edu
client_host=tier2-01
server_username=/DC=org/DC=doegrids/OU=People/CN=John R. Hover 47116
app_info=
//
// 3. Information about server
//
server_domain=usatlas.bnl.gov
server_host=dcdoor12
#end
Notes:
- The message is composed of three "sections":
- file and file-access information;
- user / session /client information;
- information about
xrootd
server that served the file / session.
- About byte-counts:
-
read/write_bytes_at_close
are reported by the server and are rounded up if the numbers are too big.
- Other
read_
, read_single_
and read_vector_
fields are summed up internally by the collector whein io
traces are enabled. The read_
without single or vector are just summed up values of single and vector entries and are redundant (there because that was the first thing implemented).
- For
read_vector_" one vector-request is counted as one operation. Further, there are =read_vector_count_
fields that give statistics about number of file-chunks asked for in a single operation.
- Configuring GSI authentication in
xrootd
is somewhat tricky as there is no common rule how security plugins map various elements into XrdSecEntity
structure that is then passed down to monitoring. We / USCMS, use grid-mapfiles on some sites and GUMS via xrootd-lcmpas
authentication plugin (and get consistent results for those cases). Depending on what ATLAS decides to use, another iteration in XrdSecGsi
might be needed.
Additional questions:
- What is the difference between bytes_read and bytes_read_at_close?
- What happens if there is something wrong with file reading/transferring. Do we get end of file reading message at all?
It depends how it happens, but Gled will always send at least some info.
a) Client session gets disconnected -- server will send the file-close message and all goes as normal.
b) File/session close UDP message is lost: there are setting for inactivity timeout from a given user ... I think the default is 24 hours. Once this is reached, all the files get reported as closed. Fields that report server-side read/write amount are zero in this case.
c) Severs goes down / gets restarted. Again, we have a timeout for no messages from a server. Additionally, servers can be configured to send periodic "server identification" messages. When these are in use, we can detect that server went down much faster (we/uscms send indent every 5min and consider server to be down when three consecutive messages do not arrive (the time delta between idents is "measured")).
- In the native xrootd monitorig flow we have no info about failures. However, if we plot timeout transferes/accesses we migt be able to get an idea about efficiency. Is it correct?
Matevz could add a status flag, telling if proper close record was received. To the first approximation
if file_close_reported = (read_bytes_at_close = 0 || write_bytes_at_close = 0) the transfer is OK, otherwise something went wrong. This can be taken to calculate an efficiency.
Discussion about close to real time monitoring
- What is a suggestion for the format for more real-time view
Matevz suggests to look into
http://xrootd.t2.ucsd.edu:4243/?no_same_site
- Summary of the mail discussion
Close-to-real-time information should be in a form of time-based events (~ once per minute). What is interesting is number of reads/writes, single and vector ones. This is contained in a file close events. Though it can happen that the file is being opened for a long time, in this case single close-file report is not enough. Seeks are useless and should not be included in the aggregated reports. What should be included in the time-based events is fileid and bytes transferred. Matevz suggested to just send "summed up" single & vector reads over the last time bin for files that had been changed. Ilija agreed that it is what is needed , but the problem is that according to Andy this would require significant changes in xrootd an it is not the top priority for the xrootd development. However the discussion is still ongoing.
Requirements and what is realistic to implement based on the current data flow
requirement |
scope |
doable or not, based on what we can get from xrootd monitoring data |
redirection staitics: fraction of time accesses are local, redirected within region, cloud , or global |
relevant both for WLCG monitor and federation level monitor |
OK given we have a proper topology in place |
Authentication successes/failures |
federation-level monitor |
Currently this info is not available in the data reprted to ActiveMQ, but is available in the ML repository , therefore there is a way to retrive it, might be a part of the smry flow, to be checked |
Number of files opened |
Federation level monitor |
Is not possible to have it ~ to real time if we have only file close reports. Should be possible if we also get reports when the file is being opened, check with Matevz whether this kind of reports can be added |
Distinguish direct access versus copy |
Relevant either for popularity, or for Federation level monitoring, not relevant for Global Transfer Dashboard |
Should provide an ability to understand whether file was accessed directly from the remote storage, or instead the copy request had been issues an dthen the file was accessed at the local site. In principle in order to compare direct vs copy, should be enough to compare #read bytes/#of transferred bytes, but # of read bytes should not include local access to files which had been transferred beforehand. The first approximation would be to consider only remote reading |
Distinguish local versus WAN |
relevant both for WLCG monitor and federation level monitor |
OK, since we can make difference based on server -client domains |
Statistics for files actually used and mode of access |
Federation level monitor |
OK, is already available in popularity, can be enabled in the federation level monitor as well |
User statistics for direct access versus copy |
Relevant either for popularity, or for Federation level monitoring, not relevant for Global Transfer Dashboard |
similar to # of bytes, but claculated per user, or in terms of user |
For brokerage, cost matrix |
- |
To make a decission from where in the federation it will be most efficient to get a file, job broker will need to know cost of each transfer. Currently we define the cost in the simplest way: a number of seconds spent to copy a well defined amount of data from one federated site to another. |
ranking plots, sites by data |
relevant both for WLCG monitor and federation level monitor |
OK |
File lifetime distributions by site |
- |
Not clear wheter it belongs here, where this info is supposed to come from, how we know when the file was deleted? |
"active" data volume at site, absolute and as a fraction of capacity, where "active" file is one used in the last X weeks/months |
Federation level monitor or rather popularity application |
This requirement implies 1).to keep long history for every file access (in order to define active files) , which might not be a good idea for the federation level monitor, 2).knowledge of storage capacity at every site. Where the latter should come from? Not sure whteher it belongs to federation monitoring, might be rather VO-level popularity application which can get aggregated daily data from the federation level monitor through an API. On the other hand can enable transparent navigation from federation level monitor to the popularity plots/tables |
plot of file age and deletion (cleanup) , and plot of avg file age at deletion by site |
Federation level monitor or rather popularity application |
Same as previous, looks like it belongs to VO-level popularity application. Needs to get data about data cleaning at the site, where is it supposed to come from? |
Site availability metrics and ranking plots |
SSB |
Though natural place for this information is SSB, federation-level monitor should provide transparent navigation to the SSB plots generating required links from the federation monitor UI |
Integration of ALICE xrootd traffic into WLCG transfer dashboard
Xrootd monitoring information from ALICE (namely active transfers information) is going to the Dashboard in 4 steps:
- Specially developed filter for MonALISA service extract transfers related messages and puts them to the local message queue
- Aggregator program reconstructs the transfer messages from their parts, join them to bigger message and send to another local queue
- The aggregated messages are sent to the message broker using stompclt running in daemon mode
- On the Dashborad's side comsumer receives the messages and extract transfer information from them
What is being monitored
Monitored objects are transfers between xrootd servers and clients. One server could have
several transfers being performed by one client. So all the transfers will appear in the monitoring information separately.
Transfer's statistics includes following information:
- timestamp (when information was sent to ML)
- server host
- client IP (and FQDN, if aggregator ws able to resolve it)
- read megabytes (for last minute)
- written megabytes (for last minute)
- MonALISA site name server belongs to
Please note, that all the transfers' information is provided for the interval of one minute.
Proposed message structure
The monitoring information is encapsulated to the transport message in JSON format having
head
and
body
section (to read more about the abstraction used, please see the
documentation of stompclt
, for the message structure see
Message::Message
object description).
head
section contains message id, timestamp and monitring host name.
body
section contains transfers monitoring information itself. It is a JSON encoded to string with
json.dumps()
. In the consumer it should be restored with
json.loads()
.
Contents of the
body
section:
"transfers": [
{
"message_id": "550e8400-e29b-41d4-a716-446655440000", // UUID string
"timestamp": "123456789",
"server_host": "host.example.com",
"site_name_ml": "XRD_SITE",
"client_ip": "1.2.3.4",
"client_fqdn": "client.phys.net", // optional, could be missing if hostname was not resolved
"client_site_alice": "", // optional, for future use
"client_gridsite": "", // optional, for future use
"read_mb": "10.543",
"written_mb": "7.891",
},
{
.....
}
]
Messaging topisc used
Now messages are being send to the test broker
dashb-mb-test
(
stomp://dashb-mb-test.cern.ch:6163
) to the topic
/topic/xrootd.ALICE
.
Federation-level monitoring
Federation-level monitoring is being currently implemented for two persistency backends : ORACLE and Hadoop. ORACLE does not match per-federation deployment model, but is used for quick prototyping of the functionality (understanding which metrics are required, aggregation levels , etc...) and user interface.
Federation-level monitoring should contain :
- transfer throughput metrics (similar to ones in the WLCG Transfer Dashboard)
- metrics which are currently available in the MonAlisa repository (so that users would have everything in one single place)
- user access information (popularity-like)
User access information
A new tab have been added to the dashboard allowing to correlate many different information and rank them depending on a parameter. This information have summarized in 3 different plots and are presented as following.
- General plot:
- Total read : Total byte read by a user during a given period.
- Single read : Byte read by a user using the normal read function.
- Vector read : Byte read by a user using the vector read function.
- Fraction read : Average fraction of a file read by the user.
- Ratio vector read : vector read / (single read + vector read).
- Access type plot (local VS remote access):
- Local access : Number of file accessed localy by a user
- Remote access : Number of file accessed remotely.
- Total access : Total number of file accessed.
- Ratio local access : local access / (local access + remote access)
- Transfer mode plot (transfer VS remote IO):
- Transfer : Number of transfer executed by the user
- Remote io : Number of read done by the user using remote IO
- Total : Total number of operation
- Ratio remote IO: remote io / (transfer + remote io)
Note that only 1 plot is displayed at the time depending on which ranking attribute you want (ie. orderby option)
Importing data from the ML repository
To be completed by Sergey
Federation-level monitoring with ORACLE backend
One DB schema is created per federation. Currently one is for the CMS federation, another is for ATLAS one. Links to the UI to be provided
Federation-level monitoring with Hadoop/HBase backend
Development state
- DAO:
- HBase: mostly done.
- Hadoop HDFS DAO:
Services deployment
- At dashboard48: collector, python summaries processor, web ui with summaries export as JSON.
Topology needs to be fixed (map: domain name -> (site, country)?).
Note that dashboard48 is
slow - e.g., its network bandwith is limited to 30Mbit/sec, in which case the process that writes/reads the net work uses 100% vm CPU. Problem cause is unknown.
Hadoop deployments
- lxfssm4401 - one from IT stuff.
- dashboard48 - a gateway to lxfssm4401.
- dashboard07 - owned by dashboard; unstable. Who wants to have admin access?
Also see
https://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=214885
for deployments section.
- DAO - possibly implement a new one to use HDFS
- Summaries processor:
- Evaluate Hive and Pig;
- Possibility to use Python UDFs to deserialize JSON columns?
- If HDFS is adopted - reimplement Hive and Pig to use it;
- Processing in python: requires DAO to be configured, but uses it intrusively/breaks incaplusation (accesses protected/private fields like HBase connection directly)
- Probably at least a check is needed to prevent its usage with Oracle DAO.
- Most definitely no need to adapt for Oracle.
- Overall evaluation - performance, accuracy, other differences with Oracle backend.
- Can we measure reliability/robustness?
- Try out/evaluate pydoop, pypig.
Useful links:
Configuring HBase DAO
Is in /opt/dashboard/etc/dashboard-dao/dashboard-dao.cfg (or the respective DAO config for service).
One has to configure table names with, possibly, table prefixes (since there's no namespaces nor 'use db' stuff like in SQL), column family name and formats to store data into tables. The available formatters are: $ 'flat': concatenates all fields in specific order, separated by ':', and puts in ':c" column. $ 'column': stores all fields in different columns, with column being concatenation of prespecified column family and a field name. $ 'json': stored JSON-serialized message in ':j' column.
Reference (current state, may change):
Section |
param name |
default value |
meaning |
[hbase-thrift] |
hostname |
must specify |
A host to connect to (e.g., hostname=lxfssm4401.cern.ch ) |
|
port |
must specify |
A port (usually 9090) |
|
table.prefix |
None ( uses no prefix ) |
A table prefix to use as a namespace. With prefix being set to "dashboard", all tables will be prefixed with "dashboard_" |
[transfers-dao-hbase] |
messages.table.name |
must specify |
The table name to store messages |
|
messages.formatter.name |
must specify |
Formatter for messages |
|
summaries.tables.prefix |
must specify |
Prefix for summaries table, in addition to table.prefix specified in [hbase-thrift] section. The final tables will be, e.g., [_]_10m for 10 minutes summaries, etc |
|
summaries.formatter.name |
must specify |
Formatter for summaries |
|
column.family.name |
must specify |
Column family (can be arbitrary, should be as short as possible (a one char like 'd' is okay), but must match the table defined in HBase schema) |
Topology
ALICE
- ALICE sites internal naming can be found from ALICE ldap server
ldapsearch -x -LLL -H ldap://aliendb06a.cern.ch:8389 -b ou=Sites,o=alice,dc=cern,dc=ch
However, ALICE internal naming may appear to be different from the custom WLCG/OSG sites naming.
Via clicking to XML one can get
http://is.grid.iu.edu/cgi-bin/status.cgi?format=xml&grid_type=OSG
Get file in the local directory:
wget -O osg_test "http://is.grid.iu.edu/cgi-bin/status.cgi?format=xml&grid_type=OSG"
paste -s -d ';' < osg_sites | awk -F \
/_/g | sed s/\ osg_sites.txt
Parcer is attached: https://twiki.cern.ch/twiki/pub/LCG/XrootdMonitoring/parce_osg.csh
- WLCG sites can be found in GOCDB:
https://wiki.egi.eu/wiki/GOCDB/PI/Technical_Documentation
Get file in the local directory (in tcsh):
wget --no-check-certificate -O serv_list "https://goc.egi.eu/gocdbpi/public/?method=get_service_endpoint"
cat serv_list | awk -F \/_/g | sed s/\ serv_list.txt
Parcer is attached: https://twiki.cern.ch/twiki/pub/LCG/XrootdMonitoring/parce_gocdb.csh
- However, there are sites that are not-WLCG resources and they should be added in addition to the matrix:
128.55.36.0/23 is LBL
LOEWE-CSC is called "HHLR_GU" on ALICE map
192.101.166.0/255 is Madrid
- Around 25% of IPs are internal and can not be easily linked through LCG/OSG or ALICE sites list
Some other tools need to be used for that cases.
- Association of the IP to the Site
Script is attached: https://twiki.cern.ch/twiki/pub/LCG/XrootdMonitoring/find_site_new.csh
The script is based on the Linux command host to get the host name and association to the Site through the domain part.
Some issues:
- command host (or nslookup) has some timeout and there is some low percentage of failure due-to timeout. For this case one has to repeat command second time
- there are some rare case when hosts are not added to rDNS and in that case other tricks should be used
- French sites are the special case. For identification one has to use not only domain part (in2p3.fr) but also the first field of the hostname, f.e.
clr means IN2P3-LPC (example: clrwn71.in2p3.fr)
cc means IN2P3-CC (example: ccwsge1317.in2p3.fr)
mar : IN2P3-CPPM
lyo : IN2P3-IPNL
lpnhe, lal,*cpp* : different GRIF institutions
nan : IN2P3-SUBATECH
sbg : IN2P3-IRES
lapp : IN2P3-LAPP
lpsc : IN2P3-LPSC