WLCG Data Transfer Monitoring
Architecture
*
XRootD Data Transfer Monitoring architecture:
Resources
Message Brokers
- broker : dashb-mb.cern.ch : {mb103, mb104, mb203, mb202, mb108}.cern.ch
Weekly meeting
Topology
FAX (ATLAS) Instance
- Production :
- machine : dashb-ai-520.cern.ch / dashb-ai-521.cern.ch
- alias : dashb-atlas-xrootd-transfers.cern.ch
- broker : dashb-mb
- broker topics : xrootd.atlas.eos / xrootd.atlas.fax.eu / xrootd.fax.us
- broker queues : Consumer.dashb-atlas.
- broker auth : dboard certificate (/home/dboard/.sercurity/*)
- database info : LCGR (DSN = lcg_xrootd_atlas)
- database account : RAW data (atlas_xrd_mon<_r,_w>) / STATS data (atlas_xrootd_dashboard<_r,_w>)
- database password : Stored in database_connection_strings.cfg (To be created and store on a secure area in AFS)
- puppet hostgroup : dashboard/web_server/xrootd/atlas/production
- GLED Collectors:
- FAX EU: alias:port ATLAS-FAX-EU-COLLECTOR:9330 dashb-ai-527
- EOS: alias:port ATLAS-XRDMON-COLLECTOR:9331 dashb-ai-528
- FAX US: maintained by Ilija Vukotic on US. Indeed this collector still serves many EU sites
- Integration :
- machine : dashb-ai-554.cern.ch
- alias : -
- broker : dashb-mb
- broker topics : xrootd.atlas.eos / xrootd.atlas.fax.eu / xrootd.fax.us
- broker queues : - (!! Direct consumption from the topic !!)
- broker auth : dboard certificate (/home/dboard/.sercurity/*)
- database info : INT11R
- database account : RAW data (atlas_xrd_mon<_r,_w>) / STATS data (atlas_xrootd_dashboard<_r,_w>)
- database password : Stored in database_connection_strings.cfg (To be created and store on a secure area in AFS)
- puppet hostgroup : dashboard/web_server/xrootd/atlas/integration
AAA (CMS) Instance
- Production
- machine : dashb-ai-522.cern.ch / dashb-ai-523.cern.ch
- alias : dashb-cms-xrootd-transfers.cern.ch
- broker : dashb-mb
- broker topics : xrootd.cms.eos / xrootd.cms.aaa
- broker queues : Consumer.dashb-cms.
- broker auth : dboard certificate (/home/dboard/.sercurity/*)
- database info : LCGR (DSN = lcg_xrootd_cms)
- database account : RAW data (cms_xrd_mon<_r,_w>) / STATS data (cms_xrootd_dashboard<_r,_w>)
- database password : Stored in database_connection_strings.cfg (To be created and store on a secure area in AFS)
- puppet hostgroup : dashboard/web_server/xrootd/cms/production
- GLED Collectors:
- EOS: alias:port CMS-XRDMON-COLLECTOR:9330 dashb-ai-529
- AAA US: maintained by Matevz Tadel in US.
- AAA EU: alias:port CMS-AAA-EU-COLLECTOR:9330 dashb-ai-652
- Integration :
- machine : dashb-ai-553.cern.ch
- alias : -
- broker : dashb-mb
- broker topics : xrootd.cms.eos / xrootd.cms.aaa
- broker queues : - (!! Direct consumption from the topic !!)
- broker auth : dboard certificate (/home/dboard/.sercurity/*)
- database info : INT11R
- database account : RAW data (cms_xrd_mon<_r,_w>) / STATS data (cms_xrootd_dashboard<_r,_w>)
- database password : Stored in database_connection_strings.cfg (To be created and store on a secure area in AFS)
- puppet hostgroup : dashboard/web_server/xrootd/cms/integration
CMS EOS Data Popularity
The production workflow for the CMS EOS data collection has a twin workflow dedicated to the data collection for CMS EOS Data Popularity.
XRootD monitoring data stored by GLED Collector in the AMQ broker xrootd.cms.eos are consumed by stompclt agents (broker queue : /queue/Consumer.popularity.xrootd.cms.eos), written into a local disk queue (of machine dashb-ai-530.cern.ch) and then from the local disk queue are inserted into a DB schema in
INT2R.
CMS Data Popularity services currently access metrics from that schema in
INT2R.
The historical reason of this second workflow goes back to the time when CMS Popularity and
XRootD transfer dashboard were still two separate R&D projects. Since then much effort has been put in place to merge the two workflows, including a restructuring of the DB schemas used by both projects, as well as the migration of all the historical data from
INT2R to LCGR. In Q3 2014 the merge will be accomplished, migrating the popularity metrics to a new account in LCGR (CMS_XROOTD_POP) that will have access to the raw data stored in CMS_XRD_MON. At that time the twin workflow will be decommissioned.
Until when the migration is not fully accomplished the collection workflow for CMS EOS Data popularity needs to be operated in order to guarantee that
XRootD monitoring data get collected also into the
INT2R DB.
The workflow is based on two running services and related cron agents that verify their status. The two services are (I) simplevisor, managing stomclt agents, (II) dashboard collector to insert data into the DB.
The machine dashb-ai-530.cern.ch has 8 GB ram, 25 GB HD, and currently hosts only this workflow. There are no hints of large load on this machine that could affect the worklfow performance
http://lemonweb.cern.ch/lemon-web/info.php?time=2.2&offset=0&entity=dashb-ai-530&detailed=yes
Based on the experience acquired in the past months, the system is able to automatically recover from most of the usual operational issues.
- Details
- machine : dashb-ai-530.cern.ch
- broker topic : xrootd.cms.eos
- broker queue : /queue/Consumer.popularity.xrootd.cms.eos
- broker auth : dboard certificate (/home/dboard/.sercurity/*)
- software: /opt/dashboard
- cron:
- /etc/cron.d/dashboard_config checks/restarts the dashboard collector to db
- /etc/cron.d/dashboard_simplevisor checks simplevisor that handles stompclt. Sends a notification if the local disk queue contains >100k messages
- agents:
- Collector: /opt/dashboard/cron/dashbCollectors.sh
- Simplevisor: //opt/dashboard/cron/dashbSimplevisor.sh
- logs:
- Collector: /opt/dashboard/var/log/dashb-xrootd_test.log
- Simplevisor: /opt/dashboard/var/log/consumer-simplevisor.log
- database info : int2r
- database account/password : stored in /opt/dashboard/etc/dashboard-dao/dashboard-dao.cfg
- Notifications: DASHB_NOTIFICATION=cms-popdb-alarms@cern.ch,dashb-aaa-alarms@cern.ch
LHCb Instance
- integration :
- broker : dashb-test-mb
- broker topics : xrootd.lhcb.eos
- broker queues : Consumer.dashb-lhcb-int.xrootd.lhcb.eos
- broker auth : dboard certificate (/home/dboard/.sercurity/*)
- database info : ?
- database account : ?
- database password : ?
- puppet hostgroup : ?
- GLED Collectors:
- EOS: alias:port LHCB-XRDMON-COLLECTOR:9330 dashb-ai-563
FTS
- Production :
- machine : dashb-ai-578.cern.ch / dashb-ai-579.cern.ch
- alias : dashb-fts-transfers.cern.ch
- broker : dashb-mb
- broker topics : transfer.fts_monitoring_start, transfer.fts_monitoring_complete, transfer.fts_monitoring_state
- broker queues : Consumer.dashb-fts.
- broker auth : dboard certificate (/home/dboard/.sercurity/*)
- database info : LCGR (DSN = lcg_dashboard_tfr)
- database account : lcg_dashboard_tfr<_r,_w>
- database password : Stored in database_connection_strings.cfg (To be created and store on a secure area in AFS)
- puppet hostgroup : dashboard/web_server/fts/production
- Production :
- machine : dashb-ai-552.cern.ch
- alias : -
- broker : dashb-mb + gridmsg201
- broker topics : transfer.fts_monitoring_start, transfer.fts_monitoring_complete, transfer.fts_monitoring_state
- broker queues : - (!! Direct consumption from the topic !!)
- broker auth : dboard certificate (/home/dboard/.sercurity/*)
- database info : INT6R
- database account : lcg_transfers_test<_r,_w>
- database password : Stored in database_connection_strings.cfg (To be created and store on a secure area in AFS)
- puppet hostgroup : dashboard/web_server/fts/integration
Software structure
GLED Collector
- Current installed version: gled-xrdmon-1.4.1-1.el6.x86_64
- Repository: wlcg repo http://linuxsoft.cern.ch/wlcg/sl6/
- Configuration: is in /etc/gled/collectors.cfg
- daemon: /sbin/service gled-xrdmon status, start, stop, restart
- logs: are in /var/log/gled/
Consumers
Based on
stompclt
and manged by simplevisor.
- Configuration (manual): /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg
- Example of consumer configuraiton:
<entry>
type = service
name = atlas_eos-gridmsg203.cern.ch
expected = running
start = /usr/bin/stompclt --incoming-broker-uri stomp+ssl://gridmsg203.cern.ch:6162 --conf /opt/dashboard/etc/dashboard-simplevisor/atlas_eos_consumer.cfg --pidfile /opt/dashboard/var/lock/gridmsg203.cern.ch-atlas_eos.pid --daemon
stop = /usr/bin/stompclt --pidfile /opt/dashboard/var/lock/gridmsg203.cern.ch-atlas_eos.pid --quit
status = /usr/bin/stompclt --pidfile /opt/dashboard/var/lock/gridmsg203.cern.ch-atlas_eos.pid --status
</entry>
- Sample of stompctl configuration /opt/dashboard/etc/dashboard-simplevisor/atlas_eos_consumer.cfg:
subscribe = "destination=/topic/xrootd.atlas.eos"
outgoing-queue = "path=/opt/dashboard/var/messages/atlas_eos"
Dashboard Collector
- Configuration (puppet): /opt/dashboard/etc/dashboard-simplevisor/services-simplevisor.cfg
- Stop/Start:
- su - dboard; /opt/dashboard/bin/dashb-agent-restart xrootd_cms
- Configuration: /opt/dashboard/etc/dashboard-service-config/
- Code: /opt/dashboard/lib/dashboard/collector/xrootd/XRootDCollector.py
Dashboard Web
Apache Httpd with mod_python
- Configuration: /etc/httpd/conf.d/dashboard.conf
- Code:
- /opt/dashboard/etc/dashboard-web/dashboard-actions_transfers.xml URL to Action mapper
- /opt/dashboard/lib/dashboard/http/actions/ : Server side actions
- /opt/dashboard/templates/ui/ : Client Side Javascript
Mailing Lists, Meetings, ?
Code Repository
GIT project
- git clone
https://:@git.cern.ch/kerberos/cosmic
Structure (for XRootD monitoring)
module name |
rpm name |
arda.dashboard.cli |
dashboard-cli |
arda.dashboard.common |
dashboard-common |
arda.dashboard.dao |
dashboard-dao |
arda.dashboard.dao-oracle |
dashboard-dao-oracle |
arda.dashboard.service-config |
dashboard-service-config |
arda.dashboard.transfers-topology |
dashboard-transfers-topology |
arda.dashboard.util-url |
dashboard-util-url |
arda.dashboard.web |
dashboard-web |
arda.dashboard.xbrowse |
dashboard-xbrowse |
arda.dashboard.xrdmon-collector |
dashboard-xrdmon-collector |
arda.dashboard.xrootd_transfers |
dashboard-xrootd_transfers |
For FTS monitoring
module name |
rpm name |
arda.dashboard.cli |
dashboard-cli |
arda.dashboard.common |
dashboard-common |
arda.dashboard.dao |
dashboard-dao |
arda.dashboard.dao-oracle |
dashboard-dao-oracle |
arda.dashboard.service-config |
dashboard-service-config |
arda.dashboard.transfers-topology |
dashboard-transfers-topology |
arda.dashboard.util-url |
dashboard-util-url |
arda.dashboard.web |
dashboard-web |
arda.dashboard.xbrowse |
dashboard-xbrowse |
arda.dashboard.transfers |
dashboard-transfers |
arda.dashboard.transfers_collector |
dashboard-transfers-collector |
Database structure
- XRootD Production Aggregation Schema. (FullSize)
This schema represents relationship between Jobs, Functions and Tables.
Blue arrows - Reads data. Red arrows - writes data.:
RAW Table
DESC T_RAW_FED
Name Null Type
------------------------- -------- --------------
UNIQUE_ID NOT NULL VARCHAR2(1000)
FILE_LFN VARCHAR2(1000)
FILE_SIZE NUMBER
CLIENT_DOMAIN VARCHAR2(1000)
CLIENT_HOST VARCHAR2(1000)
SERVER_DOMAIN VARCHAR2(1000)
SERVER_HOST VARCHAR2(1000)
READ_BYTES_AT_CLOSE NUMBER
READ_BYTES NUMBER
READ_OPERATIONS NUMBER
READ_AVERAGE NUMBER
READ_MIN NUMBER
READ_MAX NUMBER
READ_SIGMA NUMBER
READ_SINGLE_BYTES NUMBER
READ_SINGLE_OPERATIONS NUMBER
READ_SINGLE_AVERAGE NUMBER
READ_SINGLE_MIN NUMBER
READ_SINGLE_MAX NUMBER
READ_SINGLE_SIGMA NUMBER
READ_VECTOR_BYTES NUMBER
READ_VECTOR_OPERATIONS NUMBER
READ_VECTOR_AVERAGE NUMBER
READ_VECTOR_MIN NUMBER
READ_VECTOR_MAX NUMBER
READ_VECTOR_SIGMA NUMBER
READ_VECTOR_COUNT_AVERAGE NUMBER
READ_VECTOR_COUNT_MIN NUMBER
READ_VECTOR_COUNT_MAX NUMBER
READ_VECTOR_COUNT_SIGMA NUMBER
WRITE_BYTES_AT_CLOSE NUMBER
WRITE_BYTES NUMBER
WRITE_OPERATIONS NUMBER
WRITE_MIN NUMBER
WRITE_MAX NUMBER
WRITE_AVERAGE NUMBER
WRITE_SIGMA NUMBER
SERVER_USERNAME VARCHAR2(1000)
USER_DN VARCHAR2(1000)
USER_FQAN VARCHAR2(1000)
USER_ROLE VARCHAR2(1000)
USER_VO VARCHAR2(1000)
APP_INFO VARCHAR2(2048)
START_TIME NUMBER
END_TIME NUMBER
START_DATE DATE
END_DATE NOT NULL DATE
INSERT_DATE DATE
SERVER_SITE VARCHAR2(256)
QUEUE_FLAG VARCHAR2(8)
USER_PROTOCOL VARCHAR2(64)
STAT Tables
There are several tables keeping not the raw data but computed statistics. This statistic appears on the UI in the form of cool graphs and plots.
To understand better the purpose of every table the following list is presented.
Main statistics tables:
- T_STATS_AVG - this table provides information for the transfers-matrix view and transfers-bins view. The data is computed per 10min bins (the smallest possible granularity). This table has not a "twin" table which aggregates by day. Only this table contains data from T_RAW_EOS.
- T_USER_ACTIVITY - this table provides information for access-plot view. The data is computed per 10min bins (the smallest possible granularity). This table has not a "twin" table which aggregates by day.
- MV_SITE_STATS - this materialized view(not a general table) provides information for site-statistics view and site-history view.
- T_STATS_ACCESS_PATTERN - this table provides information for access-pattern view and map view. It contains data in 10min bins. The twin of the table is T_STATS_ACCESS_PATTERN_A.
- T_STATS_ACCESS_PATTERN_A - this is a twin of T_STATS_ACCESS_PATTERN which contains the information aggregated per 1day bins.
Helper tables:
- T_AGENT - this table contains information about the last successful oracle schedule job run. Every new Job run reads according field value from this table to know which time range to read from RAW tables.
- T_MSG_RATE - contains information about amount of messages received from particular domain at particular 10min time bin.
- T_SITE - keeps track of all sites showing if they performed as a clients or servers.
Outdated tables:
- T_STATS and T_STATS - previous analog of T_STATS_AVG and its aggregated version.
Procedure to fix topology resolution on user request
Today the topology is described in JSON files deployed through the rpm dashboard-transfers-topology (common to
XRootD and FTS).
file |
function |
source |
/opt/dashboard/var/data/site_topology.json |
Translate domain into site |
RPM |
/opt/dashboard/var/data/wlcg_rebus_topology.json |
Enhance site with Country |
Rebus |
/opt/dashboard/var/data/wlcg_static_topology.json |
Enhance site with Country |
RPM |
/opt/dashboard/var/data/vo_feed_.xml |
Translate site name to VO site name |
VO feeds |
/opt/dashboard/var/data/static_vo_feed_.json |
Translate site name to VO site name |
RPM |
* Case 1: Clear user request:
"Would it be possible to add the nat.nd.edu accesses to T3_US_NotreDame?"
In this case, just modify the site_topology.json and deploy the new RPM.
* Case 2: Unclear user request: "My site does not appear in the dashboard"
1. Go on the integration machine and modify the file /opt/dashboard/lib/dashboard/http/actions/transfers/ui/topology.py
At the end of the file, modify the line:
#r[c + rule['to_key']] = mapping.get(r[c + rule['from_key']], default).replace(" ", "-")
to
r[c + rule['to_key']] = mapping.get(r[c + rule['from_key']], r[c + rule['to_key']]).replace(" ", "-")
This modification says: Don't display unresolved site as n/a but using their domain name.
2. Go on the matrix view with a site grouping (ie. default) and look for all site which look like a domain.
3. Try to find the corresponding site using GOCDB or OIM topology.
4. Modify the site_topology.json and deploy the new RPM.
On topology resolution
Database tables
Samples from RAW tables:
- select distinct server_host, server_domain from t_raw_fed where end_date > '19-MAY-2014' and server_domain like '%in2p3%';
... |
server_host |
server_domain |
... |
... |
sbgse23 |
in2p3.fr |
... |
... |
sbgse20 |
in2p3.fr |
... |
... |
polgrfs27 |
in2p3.fr |
... |
... |
ccxrpli001 |
in2p3.fr |
... |
* select distinct client_host, client_domain from t_raw_fed where end_date > '19-MAY-2014' and client_domain like '%in2p3%';
... |
client_host |
client_domain |
... |
... |
sbgwn6 |
in2p3.fr |
... |
... |
grid151 |
lal.in2p3.fr |
... |
... |
grid128 |
lal.in2p3.fr |
... |
... |
ccwsge0061 |
in2p3.fr |
... |
Samples from STATS tables:
... |
src_domain |
dst_domain |
... |
... |
in2p3.fr |
in2p3.fr |
... |
... |
in2p3.fr |
lal.in2p3.fr |
... |
... |
ifca.es |
lal.in2p3.fr |
... |
site_topology.json
- cat /opt/dashboard/var/data/site_topology.json | grep in2p3
{"from_value": "in2p3.fr", "to_value": "IN2P3-CC"},
{"from_value": "lal.in2p3.fr", "to_value": "GRIF"},
{"from_value": "llrcream.in2p3.fr", "to_value": "GRIF"},
{"from_value": "lpnce.in2p3.fr", "to_value": "GRIF"},
{"from_value": "lpnhe-cream.in2p3.fr", "to_value": "GRIF"},
{"from_value": "polgrid4.in2p3.fr", "to_value": "GRIF"},
From CMS Gled US
UCSD CMS::IN2P3::XrdReport ccxrdpli001.in2p3.fr site T1_FR_CCIN2P3 20 May 2014 05:34
UCSD CMS::IN2P3::XrdReport llrpp01.in2p3.fr site T2_FR_GRIF_LLR 20 May 2014 05:34
UCSD CMS::IN2P3::XrdReport llrpp02.in2p3.fr site T2_FR_GRIF_LLR 20 May 2014 05:33
UCSD CMS::IN2P3::XrdReport llrpp03.in2p3.fr site T2_FR_GRIF_LLR 20 May 2014 05:32
Constraints
- XRootD dashboard must be able to return data with VO-site name AND GOCDB names. E.g.
T_TOPOLOGY table
On the ATLAS INTEGRATION owner schema.
It contains:
PATTERN VO_NAME
lapp[a-z0-9-]*\.in2p3\.fr IN2P3-LAPP
lpn[a-z0-9-]*\.in2p3\.fr GRIF-LPNHE
\.lal\.in2p3\.fr GRIF-LAL
lpsc[a-z0-9-]*\.in2p3\.fr IN2P3-LPSC
mar[a-z0-9-]*\.in2p3\.fr CPPM
\.roma1\.infn\.it INFN_Roma1
\.lnf\.infn\.it INFN_Frascati
\.na\.infn\.it INFN_Napoli
\.mi\.infn\.it INFN_MILANO_ATLASC
\.cnaf\.infn\.it INFN-T1
grid-lab105\.desy\.de GRID-LAB
\.desy\.de DESY-HH
\.cyf-kr\.edu\.pl CYFRONET-LCG2
\.poznan\.pl PSNC
\.esc\.qmul UKI-LT2-QMUL
\.ecdf\.ed\.ac\.uk UKI-SCOTGRID-ECDF
\.gla\.scotgrid\.ac\.uk UKI-SCOTGRID-GLASGOW
\.rl\.ac\.uk RAL-LCG2
\.pp\.rl\.ac\.uk UKI-SOUTHGRID-RALPP
Fix options
* The strategy to improve topology resolution is described here:
https://its.cern.ch/jira/browse/WDT-1
CMS site reporting
Sites on CMS federation are instrumented to report site name (with CMS nomenclature) in the message:
But:
- Not all site there yet
- Multi VO sites are using GOCDB site nomenclature (e.g. BUDAPEST)
Resources
Useful presentation are attached to the page
Old twiki
Deployment
- stable repo ai6
- testing repo ai6-testing
- integration machines are not associated to ai6-testing
- when new rpm have to be installed in integration, the ai6-testing repo have to be manually enabled by the yum command
- then run puppet
New relase procedure
- Change code, commit on trunk...
- Go to the local directory of the cosmic module you'd like to build
- python setup.py release -c (-m or M, go for relase --help for details)
- In case of issues: * git commit --amend * remove inserted line on RELEASE_NOTES and change back release number on setup.cfg
- On the integration machine: yum update dashboard-xbrowse --enablerepo=ai6-testing
- Promote package to QA
- On production
---+++ Database connection strings
- puppet managed, per application set of connection of strings,
Configuration
- /opt/dashboard/etc/dashboarddao.
- Topology:
- resolving host/domain 2 site at aggregation level
- getting server site from message, client site with regexp
- requires re-computation of all statistics (long processes, several weeks)
- requires change at web-dashboard/ui
- Multi-VO:
- Message of mixed VO
- Splitting at GLED-mix in autumn
- Option on temporary filtering at connector level with vo name:
- no idea if the voname is correctly present in the message
- then connector have to change and configured to handle 2 db
Roadmap - FTS
- Integration ASO/Analysis jobs Dashboard
- Include selection per user in the FTS Dashboard (Messages sent by FTS instances need to include the user DN. To follow-up with FTS Devs.)
- Fix FTS 3 jobs views in FTS Dashboard (Messages sent by FTS instances need to include the instance name. To follow-up with FTS Devs.))
Interesting Meetings
--
LucaMagnoni - 14 May 2014
Technical Guide for the XRootD Dashboard
A technical guide for the deployment, management and use of the
XRootD Monitoring Dashboard can be found in
XRootDDashboardGuide.
Oracle Job Monitoring
http://dashb-ai-638.cern.ch/cosmic/OracleJobMonitoring/
Dashboard-Monalisa XRootD Comparator
http://dashb-ai-621.cern.ch/cosmic/Comparator/
EU GLED migration
http://dashb-xrootd-comp.cern.ch/cosmic/CMSmigrationMonitoring/
http://dashb-xrootd-comp.cern.ch/cosmic/ATLASmigrationMonitoring/