ATLAS DDM Dashboard Hand-over Guide
Overview
The ATLAS DDM Dashboard monitors the ATLAS Distributed Data Management (DDM) system. The DDM system is currently being migrated from one implementation "DQ2" to a new implementation "Rucio" (
http://rucio.cern.ch/
). For this reason there are currently two dashboards for monitoring DDM: "DDM (DQ2) Dashboard for DQ2" and "DDM (Rucio) Dashboard". It is anticipated that DQ2 will be decommissioned before the end of 2014. At that point the corresponding dashboard can be decommissioned and we will support only DDM (Rucio) Dashboard.
Feature Comparison
DDM (Rucio) Dashboard
This dashboard has a single interface:
Architecture
In general this dashboard follows the same architecture as the other WLCG transfer monitoring dashboards, see
WLCGDataTransferMonitoring.
Deployment
Production
Rucio (Managed by ATLAS)
AGIS (Managed by ATLAS)
ActiveMQ (Managed by MIG)
- Overview: Messaging brokers for ATLAS.
- Broker alias: atlasddm-mb (gridmsg103, gridmsg104)
- Support: mig@cernNOSPAMPLEASE.ch
Stompctl / dirq
- Overview: MIG components configured to consume from ActiveMQ to a directory queue.
- Machines: dashb-ai-641, dashb-ai-642
- Puppet hostgroup: dashboard/web_server/rucio/production
- RPM: dashboard-service-collector-ddm + dependencies
- Queue: /queue/Consumer.dashboard.rucio.events (virtual queue for /topic/rucio.events)
- Dirq: /opt/dashboard/var/messages/rucio.events_gridmsg103, /opt/dashboard/var/messages/rucio.events_gridmsg104
- Notes: The configuration is statically bound to gridmsg103, gridmsg104 so if the ActiveMQ cluster changes the configuration need changing (in RPM).
DDM (DQ2) DB (Dashboard database)
- Overview: This is the database for DDM (DQ2) Dashboard. DQ2 statistics are copied from it to DDM (Rucio) Dashboard every 10 minutes.
- Reader account: atlas_dashboard_dm_reader @ atlas_dashboard_dm (i.e. LCGR)
- Tables:
- Only t_stats_file is accessed: File statistics / error samples (10 minute bins).
- Support: PhyDB.Support@cernNOSPAMPLEASE.ch
- Resources:
Dashboard Database
- Overview: Used by UI server and agents.
- Admin: atlas_dashboard_ddm @ atlas_dashboard_dm (i.e. LCGR) NOTE: subtle name difference with DDM (DQ2) Dashboard database!
- Reader: atlas_dashboard_ddm_reader @ atlas_dashboard_dm
- Writer: atlas_dashboard_ddm_writer @ atlas_dashboard_dm
- Tables:
- Meta tables:
- t_schema_version: Schema version. (Useful for upgrades but not used by the application)
- t_agent: Agent update times.
- t_site: Site topology.
- Raw tables:
- Statistics tables:
- t_stats_file: File statistics / error samples (10 minute bins)
- t_stats_file_a: File statistics / error samples (24 hour bins)
- t_stats_transfer_rate: Average and std dev file statistics (10 minute bins). (Legacy. Could be dropped? See Upcoming section for more details.)
- Support: PhyDB.Support@cernNOSPAMPLEASE.ch
- Resources:
UI servers
- Overview: Serves JSON for UI/API.
- Machines: dashb-ai-641, dashb-ai-642
- Alias: DASHB-ATLAS-DDM
- Puppet hostgroup: dashboard/web_server/rucio/production
- RPM: dashboard-web-ddm + dependencies
- Database: atlas_dashboard_ddm_reader @ atlas_dashboard_dm (i.e. LCGR)
Agents
- Machines: dashb-ai-641, dashb-ai-642 (ddm.collector.dirq2db only)
- Agents:
- ddm.collector.dirq2db
- Executes code to collect file events from dirq2 to t_raw_file continuously.
- See stompctl / dirq section for more details.
- ddm.statistics.file
- Executes PL/SQL procedure "dashboard_ddm.compLatestStatsFile" every 10 minutes to calculate file statistics in 10 minute bins.
- Executes PL/SQL procedure "dashboard_ddm.aggrLatestStatsFile" every 10 minutes to aggregate file statistics to 24 hour bins.
- Statistics stored in DB table t_stats_file and t_stats_file_a.
- Statistics available in API and UI.
- ddm.statistics.dq2
- Reads dq2 statistics from DDM (DQ2) Dashboard database and merges into file statistics in 10 minute bins.
- Statistics stored in DB table t_stats_file (and aggregated into t_stats_file_a by ddm.statistics.file).
- Statistics available API and UI.
- NOTE: This is not the same ast "statistics.dq2" in DDM (DQ2) Dashboard.
- ddm.collector.agis
- Reads site topology from AGIS and merges into site table every 1 hour.
- Sites stored in DB table t_site.
- ddm.database.admin
- Executes PL/SQL procedure "dashboard_admin.dropOldestPartition" every 12 hours to drop oldest partition (if older than 90 days) from t_raw_file.
- ddm.statistics.transfer.rate (Legacy. Could be dropped? See Upcoming section for more details.)
- Executes PL/SQL procedure "dashboard_ddm.compLatestStatsTransferRate" every 10 minutes to calculate average and std dev transfer statistics.
- Statistics stored in DB table t_stats_transfer_rate.
- Statistics available in API but not in any UI.
- Puppet hostgroup: dashboard/web_server/rucio/production
- RPM: dashboard-service-monitor-ddm + dependencies
- Database:
- atlas_dashboard_ddm_writer @ atlas_dashboard_dm (i.e. LCGR) for all agents.
- atlas_dashboard_dm_reader @ atlas_dashboard_dm (i.e. LCGR) for ddm.statistics.dq2 agent.
- Resources:
- You can check the logs such as /opt/dashboard/var/log/dashb-ddm.statistics.file.log to see the agents are running every 10 minutes as expected.
- You can check DB table t_agent to see that the agents are saving a heartbeat.
- You can check the database session manager to see that no procedure is stuck.
Integration
Rucio (Managed by ATLAS)
AGIS (Managed by ATLAS)
ActiveMQ (Managed by MIG)
Stompctl / dirq
- As production except the following.
- Machines: dashb-ai-611
- Puppet hostgroup: dashboard/web_server/rucio/integration
- Queue: /topic/rucio.events
DDM (DQ2) DB (Dashboard database)
- Not used since ddm.statistics.dq2 agent is not running on integration.
Dashboard Database
- As production except the following.
- Admin: atlas_dashboard_dm @ int11r
- Reader: atlas_dashboard_dm_reader @ int11r
- Writer: atlas_dashboard_dm_writer @ int11r
UI servers
- As production except the following.
- Machines: dashb-ai-611
- Alias: DASHB-ATLAS-DATA-DEV
- Puppet hostgroup: dashboard/web_server/rucio/integration
- Database: atlas_dashboard_dm_reader @ int11r
Agents
- Machines: dashb-ai-611
- Agents:
- ddm.collector.dirq2db
- ddm.statistics.file
- ddm.statistics.dq2
- ddm.collector.agis
- ddm.database.admin
- ddm.statistics.transfer.rate (Legacy. Could be dropped? See Upcoming section for more details.)
- Puppet hostgroup: dashboard/web_server/rucio/integration
- Database:
- atlas_dashboard_dm_writer @ int11r for all agents.
Operations
- There are no regular operational tasks but it is very similar to the other WLCG transfer dashboards so it is likely to suffer from the same issues as those (e.g. stacked message queues etc.).
- Email alerts are configured on the main dashboard log.
- If the production plots are empty:
- Check agent logs e.g. /opt/dashboard/var/log/dashb-ddm.collector.dirq2db.log show continuous entries for consuming raw events.
- Warnings for unknown event types are OK.
- Check agent logs e.g. /opt/dashboard/var/log/dashb-statistics.file.log show regular (every 10 minutes entries) for computing statistics.
- Check broker admin interfaces show queues are being consumed.
- JIRA: https://its.cern.ch/jira/browse/WLCGMON/component/13710
- See support references in deployment section above.
Upcoming
DDM (DQ2) Dashboard
This dashboard has two interfaces but it is a single application (i.e. same RPMs, same database).
Architecture
Deployment
Production
Site Services (Managed by ATLAS)
- Overview: Installed on ATLAS VO boxes at CERN. Part of the DDM DQ2 system. Send notifications of transfer events to HTTP consumers. If transfer events cannot be consumed because the HTTP consumers are slow or down then the events are stored in a local database and retried. If the backlog of events gets too large you will be notified by ATLAS DQ2 support.
- Support: atlas-dq2-support@cernNOSPAMPLEASE.ch or Tomas.Kouba@cernNOSPAMPLEASE.ch
- Resources:
ADCR Database (Managed by ATLAS)
- Overview: Used by dq2 stats agent to collect statistics on ad-hoc dq2 transfers (i.e. dq2-get / put).
- Reader account: atlas_dashb_dq2_r @ ADCR_ADG (i.e. Active Data Guard for ADCR)
- Support: Gancho.Dimitrov@cernNOSPAMPLEASE.ch
AGIS (Managed by ATLAS)
HTTP consumers
- Overview: HTTP API for receiving notifications of transfer events from DDM site services.
- Machines: dashb-ai-534, dashb-ai-535
- Alias: DASHB-ATLAS-DATA-CONSUMER (also for historical reasons: DASHB-ATLAS-DATA-PRODUCER, DASHB-ATLAS-DATA-CONSUMER, DASHB-ATLAS-DATA-CONSUMER-EXT, DASHB-ATLAS-DATA-PROD-EXT, DASHB-ATLAS-DATA-PROD-TEST)
- Puppet hostgroup: dashboard/ddm_consumer
- RPM: dashboard-web-data-consumer + dependencies
- Database: atlas_dashboard_dm_writer @ atlas_dashboard_dm (i.e. LCGR)
Dashboard Database
- Overview: Used by consumer, UI server and agents.
- Admin: atlas_dashboard_dm @ atlas_dashboard_dm (i.e. LCGR)
- Reader: atlas_dashboard_dm_reader @ atlas_dashboard_dm
- Writer: atlas_dashboard_dm_writer @ atlas_dashboard_dm
- Tables:
- Meta tables:
- t_schema_version: Schema version. (Useful for upgrades but not used by the application)
- t_agent: Agent update times.
- t_site: Site topology.
- Raw tables:
- t_dataset, t_dataset_location, t_file, t_file_location, t_dataset_file (Legacy used in 1.x API/UI)
- t_file_placement (Used in 1.x and 2.x UI/API)
- Note: Raw data is split across many tables which makes database cleaning very hard. This is fixed in DDM (Rucio) Dashboard.
- Statistics tables:
- t_service_status (Unused)
- t_service_metrics (Unused)
- t_sched_downtime (Unused)
- t_service_error (Unused)
- t_stats_data_site_single: Dataset registration statistics (10 minute bins). (Legacy used in 1.x API/UI)
- t_stats_a_data_site_single: Dataset registration statistics (24 hour bins). (Legacy used in 1.x API/UI)
- t_stats_error_site_single: Dataset registration error samples (10 minute bins). (Legacy used in 1.x API/UI)
- t_stats_a_error_site_single: Dataset registration error samples (24 hour bins). (Legacy used in 1.x API/UI)
- t_stats_data_transfer_rate: Average and std dev file statistics (10 minute bins). (Used in 2.x API for Site Services)
- t_stats_file: File statistics / error samples (10 minute bins). (Used in 2.x API/UI)
- t_stats_file_a: File statistics / error samples (24 hour bins). (Used in 2.x API/UI)
- Support: PhyDB.Support@cernNOSPAMPLEASE.ch
- Resources:
UI servers
- Overview: Serves HTML and PNG for 1.x and JSON for 2.x.
- Machines: dashb-ai-532, dashb-ai-533
- Alias: DASHB-ATLAS-DATA (also for historical reasons: DASHB-ATLAS-DATA-TEST)
- Puppet hostgroup: dashboard/web_server/ddm/production
- RPM: dashboard-web-data + dependencies
- Database: atlas_dashboard_dm_reader @ atlas_dashboard_dm (i.e. LCGR)
Agents
- Machines: dashb-ai-532
- Agents:
- data.stats.collection (Legacy for 1.x)
- Executes PL/SQL procedure "dashboard.computeSiteSingleDataStats" every 5 minutes to calculate dataset registration statistics in 10 minute bins.
- Note: This procedure is not efficient and is the most likely to run slowly. However, it is not used for 2.x so it is not very significant.
- Executes PL/SQL procedure "dashboard.computeSiteSingleErrorSums" every 10 minutes to calculate dataset registration error samples in 10 minute bins.
- Executes PL/SQL procedure "dashboard.aggregateDataStatistics" and "dashboard.aggregateErrorSummaries" every 10 minutes to aggregate dataset registration statistics / error samples to 24 hour bins.
- Statistics / error samples stored in DB tables t_stats_data_site_single, t_stats_a_data_site_single, t_stats_error_site_single, and t_stats_a_error_site_single.
- Statistics / error samples available via 1.x API and in 1.x UI.
- transfer.rate.stats
- Executes PL/SQL procedure "dashboard.computeDataTransferRateStats" every 10 minutes to calculate average and std dev transfer statistics.
- Statistics stored in DB table t_stats_data_transfer_rate.
- Statistics available 2.x API but not in any UI.
- API used by Site Services for optimisation and by http://bourricot.cern.ch/dq2/ftsmon/
(which will be decommissioned in the future, see ADC Monitoring for further details).
- statistics.file
- Executes PL/SQL procedure "dashboard_new.compLatestStatsFile" every 10 minutes to calculate file statistics in 10 minute bins.
- Executes PL/SQL procedure "dashboard_new.aggrLatestStatsFile" every 10 minutes to aggregate file statistics to 24 hour bins.
- Statistics stored in DB table t_stats_file and t_stats_file_a.
- Statistics available 2.x API and 2.x UI.
- statistics.dq2
- Reads statistics on dq2-get/put activity from ATLAS DDM database and merges into file statistics in 10 minute bins.
- Statistics stored in DB table t_stats_file (and aggregated into t_stats_file_a by statistics.file).
- Statistics available 2.x API and 2.x UI.
- ddm.collector.agis
- Reads site topology from AGIS and merges into site table every 1 hour.
- Sites stored in DB table t_site.
- Puppet hostgroup: dashboard/web_server/ddm/production
- RPM: dashboard-service-monitor-site + dependencies
- Database:
- atlas_dashboard_dm_writer @ atlas_dashboard_dm (i.e. LCGR) for all agents.
- atlas_dashb_dq2_r @ ADCR_ADG (i.e. Active Data Guard for ADCR) for dq2 stats agent.
- Resources:
- You can check the logs such as /opt/dashboard/var/log/dashb-statistics.file.log to see the agents are running every 10 minutes as expected.
- You can check DB table t_agent to see that the agents are saving a heartbeat.
- You can check the database session manager to see that no procedure is stuck.
Integration
AGIS (Managed by ATLAS)
Site Services (Managed by ATLAS)
- As production except we receive very few transfers events.
ADCR Database (Managed by ATLAS)
- Not used since statistics.dq2 agent is not running on integration.
HTTP consumers
- As production except the following.
- Machines: dashb-ai-531
- Alias: DASHB-ATLAS-DATA-SOUP-TBED-CONSUMER
- Puppet hostgroup: dashboard/web_server/ddm/integration
- Database: atlas_dashboard_dm_writer @ int6r
Dashboard Database
- As production except the following.
- Admin: atlas_dashboard_dm @ int6r
- Reader: atlas_dashboard_dm_reader @ int6r
- Writer: atlas_dashboard_dm_writer @ int6r
UI servers
- As production except the following.
- Machines: dashb-ai-531
- Alias: DASHB-ATLAS-DATA-SOUP-TBED
- Puppet hostgroup: dashboard/web_server/ddm/integration
- Database: atlas_dashboard_dm_reader @ int6r
Agents
- As production except the following.
- Machines: dashb-ai-531
- Agents:
- data.stats.collection (Legacy for 1.x)
- transfer.rate.stats
- statistics.file
- statistics.dq2
- ddm.collector.agis
- Puppet hostgroup: dashboard/web_server/ddm/integration
- RPM: dashboard-service-monitor-site + dependencies
- Database:
- atlas_dashboard_dm_writer @ int6r for all agents.
Operations
- There are no regular operational tasks.
- Email alerts are configured on the main dashboard log.
- If the production plots are empty:
- Check /opt/dashboard/var/log on HTTP consumers are accessible, shows regular entries and no errors.
- This would be due VM down (inaccessible) or database issues (slow insertion).
- Check agent logs e.g. /opt/dashboard/var/log/dashb-statistics.file.log show regular (every 10 minutes entries).
- This is typically because the database has issues. Usually another application is overloading it.
- JIRA: https://its.cern.ch/jira/browse/WLCGMON/component/13710
- See support references in deployment section above.
Upcoming
- DDM (DQ2) Dashboard should be decommissioned after DQ2 has been decommissioned. Old statistics have already been copied to DDM (Rucio) Dashboard and current statistics are being copied by the agent ddm.statistics.dq2 on DDM (Rucio) Dashboard. So decommissioning consists of the following:
- Consult with ADC Monitoring to co-ordinate the decommissioning.
- Verify all file statistics have been copied to DDM (Rucio) Dashboard. This should already be the case.
- Stop the ddm.statistics.dq2 agent on DDM (Rucio) Dashboard.
- Delete DDM (DQ2) Dashboard servers and database.
--
DavidTuckett - 10 Sep 2014