ATLAS DDM Dashboard Hand-over Guide

Overview

The ATLAS DDM Dashboard monitors the ATLAS Distributed Data Management (DDM) system. The DDM system is currently being migrated from one implementation "DQ2" to a new implementation "Rucio" (http://rucio.cern.ch/). For this reason there are currently two dashboards for monitoring DDM: "DDM (DQ2) Dashboard for DQ2" and "DDM (Rucio) Dashboard". It is anticipated that DQ2 will be decommissioned before the end of 2014. At that point the corresponding dashboard can be decommissioned and we will support only DDM (Rucio) Dashboard.

Feature Comparison

ddm_dashb_rucio_matrix2.png

DDM (Rucio) Dashboard

This dashboard has a single interface:

Architecture

In general this dashboard follows the same architecture as the other WLCG transfer monitoring dashboards, see WLCGDataTransferMonitoring.

rucio_archi.png

Deployment

Production

Rucio (Managed by ATLAS)

AGIS (Managed by ATLAS)

ActiveMQ (Managed by MIG)
  • Overview: Messaging brokers for ATLAS.
  • Broker alias: atlasddm-mb (gridmsg103, gridmsg104)
  • Support: mig@cernNOSPAMPLEASE.ch

Stompctl / dirq
  • Overview: MIG components configured to consume from ActiveMQ to a directory queue.
  • Machines: dashb-ai-641, dashb-ai-642
  • Puppet hostgroup: dashboard/web_server/rucio/production
  • RPM: dashboard-service-collector-ddm + dependencies
  • Queue: /queue/Consumer.dashboard.rucio.events (virtual queue for /topic/rucio.events)
  • Dirq: /opt/dashboard/var/messages/rucio.events_gridmsg103, /opt/dashboard/var/messages/rucio.events_gridmsg104
  • Notes: The configuration is statically bound to gridmsg103, gridmsg104 so if the ActiveMQ cluster changes the configuration need changing (in RPM).

DDM (DQ2) DB (Dashboard database)
  • Overview: This is the database for DDM (DQ2) Dashboard. DQ2 statistics are copied from it to DDM (Rucio) Dashboard every 10 minutes.
  • Reader account: atlas_dashboard_dm_reader @ atlas_dashboard_dm (i.e. LCGR)
  • Tables:
    • Only t_stats_file is accessed: File statistics / error samples (10 minute bins).
  • Support: PhyDB.Support@cernNOSPAMPLEASE.ch
  • Resources:

Dashboard Database
  • Overview: Used by UI server and agents.
  • Admin: atlas_dashboard_ddm @ atlas_dashboard_dm (i.e. LCGR) NOTE: subtle name difference with DDM (DQ2) Dashboard database!
  • Reader: atlas_dashboard_ddm_reader @ atlas_dashboard_dm
  • Writer: atlas_dashboard_ddm_writer @ atlas_dashboard_dm
  • Tables:
    • Meta tables:
      • t_schema_version: Schema version. (Useful for upgrades but not used by the application)
      • t_agent: Agent update times.
      • t_site: Site topology.
    • Raw tables:
      • t_raw_file
    • Statistics tables:
      • t_stats_file: File statistics / error samples (10 minute bins)
      • t_stats_file_a: File statistics / error samples (24 hour bins)
      • t_stats_transfer_rate: Average and std dev file statistics (10 minute bins). (Legacy. Could be dropped? See Upcoming section for more details.)
  • Support: PhyDB.Support@cernNOSPAMPLEASE.ch
  • Resources:

UI servers
  • Overview: Serves JSON for UI/API.
  • Machines: dashb-ai-641, dashb-ai-642
  • Alias: DASHB-ATLAS-DDM
  • Puppet hostgroup: dashboard/web_server/rucio/production
  • RPM: dashboard-web-ddm + dependencies
  • Database: atlas_dashboard_ddm_reader @ atlas_dashboard_dm (i.e. LCGR)

Agents
  • Machines: dashb-ai-641, dashb-ai-642 (ddm.collector.dirq2db only)
  • Agents:
    • ddm.collector.dirq2db
      • Executes code to collect file events from dirq2 to t_raw_file continuously.
      • See stompctl / dirq section for more details.
    • ddm.statistics.file
      • Executes PL/SQL procedure "dashboard_ddm.compLatestStatsFile" every 10 minutes to calculate file statistics in 10 minute bins.
      • Executes PL/SQL procedure "dashboard_ddm.aggrLatestStatsFile" every 10 minutes to aggregate file statistics to 24 hour bins.
      • Statistics stored in DB table t_stats_file and t_stats_file_a.
      • Statistics available in API and UI.
    • ddm.statistics.dq2
      • Reads dq2 statistics from DDM (DQ2) Dashboard database and merges into file statistics in 10 minute bins.
      • Statistics stored in DB table t_stats_file (and aggregated into t_stats_file_a by ddm.statistics.file).
      • Statistics available API and UI.
      • NOTE: This is not the same ast "statistics.dq2" in DDM (DQ2) Dashboard.
    • ddm.collector.agis
      • Reads site topology from AGIS and merges into site table every 1 hour.
      • Sites stored in DB table t_site.
    • ddm.database.admin
      • Executes PL/SQL procedure "dashboard_admin.dropOldestPartition" every 12 hours to drop oldest partition (if older than 90 days) from t_raw_file.
    • ddm.statistics.transfer.rate (Legacy. Could be dropped? See Upcoming section for more details.)
      • Executes PL/SQL procedure "dashboard_ddm.compLatestStatsTransferRate" every 10 minutes to calculate average and std dev transfer statistics.
      • Statistics stored in DB table t_stats_transfer_rate.
      • Statistics available in API but not in any UI.
  • Puppet hostgroup: dashboard/web_server/rucio/production
  • RPM: dashboard-service-monitor-ddm + dependencies
  • Database:
    • atlas_dashboard_ddm_writer @ atlas_dashboard_dm (i.e. LCGR) for all agents.
    • atlas_dashboard_dm_reader @ atlas_dashboard_dm (i.e. LCGR) for ddm.statistics.dq2 agent.
  • Resources:
    • You can check the logs such as /opt/dashboard/var/log/dashb-ddm.statistics.file.log to see the agents are running every 10 minutes as expected.
    • You can check DB table t_agent to see that the agents are saving a heartbeat.
    • You can check the database session manager to see that no procedure is stuck.

Integration

Rucio (Managed by ATLAS)
  • As production.

AGIS (Managed by ATLAS)
  • As production.

ActiveMQ (Managed by MIG)
  • As production.

Stompctl / dirq
  • As production except the following.
  • Machines: dashb-ai-611
  • Puppet hostgroup: dashboard/web_server/rucio/integration
  • Queue: /topic/rucio.events

DDM (DQ2) DB (Dashboard database)
  • Not used since ddm.statistics.dq2 agent is not running on integration.

Dashboard Database
  • As production except the following.
  • Admin: atlas_dashboard_dm @ int11r
  • Reader: atlas_dashboard_dm_reader @ int11r
  • Writer: atlas_dashboard_dm_writer @ int11r

UI servers
  • As production except the following.
  • Machines: dashb-ai-611
  • Alias: DASHB-ATLAS-DATA-DEV
  • Puppet hostgroup: dashboard/web_server/rucio/integration
  • Database: atlas_dashboard_dm_reader @ int11r

Agents
  • Machines: dashb-ai-611
  • Agents:
    • ddm.collector.dirq2db
    • ddm.statistics.file
    • ddm.statistics.dq2
      • Not running.
    • ddm.collector.agis
    • ddm.database.admin
    • ddm.statistics.transfer.rate (Legacy. Could be dropped? See Upcoming section for more details.)
  • Puppet hostgroup: dashboard/web_server/rucio/integration
  • Database:
    • atlas_dashboard_dm_writer @ int11r for all agents.

Operations

  • There are no regular operational tasks but it is very similar to the other WLCG transfer dashboards so it is likely to suffer from the same issues as those (e.g. stacked message queues etc.).
  • Email alerts are configured on the main dashboard log.
  • If the production plots are empty:
    • Check agent logs e.g. /opt/dashboard/var/log/dashb-ddm.collector.dirq2db.log show continuous entries for consuming raw events.
      • Warnings for unknown event types are OK.
    • Check agent logs e.g. /opt/dashboard/var/log/dashb-statistics.file.log show regular (every 10 minutes entries) for computing statistics.
    • Check broker admin interfaces show queues are being consumed.
  • JIRA: https://its.cern.ch/jira/browse/WLCGMON/component/13710
  • See support references in deployment section above.

Upcoming

DDM (DQ2) Dashboard

This dashboard has two interfaces but it is a single application (i.e. same RPMs, same database).

Architecture

dq2_archi.png

Deployment

Production

Site Services (Managed by ATLAS)

ADCR Database (Managed by ATLAS)
  • Overview: Used by dq2 stats agent to collect statistics on ad-hoc dq2 transfers (i.e. dq2-get / put).
  • Reader account: atlas_dashb_dq2_r @ ADCR_ADG (i.e. Active Data Guard for ADCR)
  • Support: Gancho.Dimitrov@cernNOSPAMPLEASE.ch

AGIS (Managed by ATLAS)

HTTP consumers
  • Overview: HTTP API for receiving notifications of transfer events from DDM site services.
  • Machines: dashb-ai-534, dashb-ai-535
  • Alias: DASHB-ATLAS-DATA-CONSUMER (also for historical reasons: DASHB-ATLAS-DATA-PRODUCER, DASHB-ATLAS-DATA-CONSUMER, DASHB-ATLAS-DATA-CONSUMER-EXT, DASHB-ATLAS-DATA-PROD-EXT, DASHB-ATLAS-DATA-PROD-TEST)
  • Puppet hostgroup: dashboard/ddm_consumer
  • RPM: dashboard-web-data-consumer + dependencies
  • Database: atlas_dashboard_dm_writer @ atlas_dashboard_dm (i.e. LCGR)

Dashboard Database
  • Overview: Used by consumer, UI server and agents.
  • Admin: atlas_dashboard_dm @ atlas_dashboard_dm (i.e. LCGR)
  • Reader: atlas_dashboard_dm_reader @ atlas_dashboard_dm
  • Writer: atlas_dashboard_dm_writer @ atlas_dashboard_dm
  • Tables:
    • Meta tables:
      • t_schema_version: Schema version. (Useful for upgrades but not used by the application)
      • t_agent: Agent update times.
      • t_site: Site topology.
    • Raw tables:
      • t_dataset, t_dataset_location, t_file, t_file_location, t_dataset_file (Legacy used in 1.x API/UI)
      • t_file_placement (Used in 1.x and 2.x UI/API)
      • Note: Raw data is split across many tables which makes database cleaning very hard. This is fixed in DDM (Rucio) Dashboard.
    • Statistics tables:
      • t_service_status (Unused)
      • t_service_metrics (Unused)
      • t_sched_downtime (Unused)
      • t_service_error (Unused)
      • t_stats_data_site_single: Dataset registration statistics (10 minute bins). (Legacy used in 1.x API/UI)
      • t_stats_a_data_site_single: Dataset registration statistics (24 hour bins). (Legacy used in 1.x API/UI)
      • t_stats_error_site_single: Dataset registration error samples (10 minute bins). (Legacy used in 1.x API/UI)
      • t_stats_a_error_site_single: Dataset registration error samples (24 hour bins). (Legacy used in 1.x API/UI)
      • t_stats_data_transfer_rate: Average and std dev file statistics (10 minute bins). (Used in 2.x API for Site Services)
      • t_stats_file: File statistics / error samples (10 minute bins). (Used in 2.x API/UI)
      • t_stats_file_a: File statistics / error samples (24 hour bins). (Used in 2.x API/UI)
  • Support: PhyDB.Support@cernNOSPAMPLEASE.ch
  • Resources:

UI servers
  • Overview: Serves HTML and PNG for 1.x and JSON for 2.x.
  • Machines: dashb-ai-532, dashb-ai-533
  • Alias: DASHB-ATLAS-DATA (also for historical reasons: DASHB-ATLAS-DATA-TEST)
  • Puppet hostgroup: dashboard/web_server/ddm/production
  • RPM: dashboard-web-data + dependencies
  • Database: atlas_dashboard_dm_reader @ atlas_dashboard_dm (i.e. LCGR)

Agents
  • Machines: dashb-ai-532
  • Agents:
    • data.stats.collection (Legacy for 1.x)
      • Executes PL/SQL procedure "dashboard.computeSiteSingleDataStats" every 5 minutes to calculate dataset registration statistics in 10 minute bins.
      • Note: This procedure is not efficient and is the most likely to run slowly. However, it is not used for 2.x so it is not very significant.
      • Executes PL/SQL procedure "dashboard.computeSiteSingleErrorSums" every 10 minutes to calculate dataset registration error samples in 10 minute bins.
      • Executes PL/SQL procedure "dashboard.aggregateDataStatistics" and "dashboard.aggregateErrorSummaries" every 10 minutes to aggregate dataset registration statistics / error samples to 24 hour bins.
      • Statistics / error samples stored in DB tables t_stats_data_site_single, t_stats_a_data_site_single, t_stats_error_site_single, and t_stats_a_error_site_single.
      • Statistics / error samples available via 1.x API and in 1.x UI.
    • transfer.rate.stats
      • Executes PL/SQL procedure "dashboard.computeDataTransferRateStats" every 10 minutes to calculate average and std dev transfer statistics.
      • Statistics stored in DB table t_stats_data_transfer_rate.
      • Statistics available 2.x API but not in any UI.
      • API used by Site Services for optimisation and by http://bourricot.cern.ch/dq2/ftsmon/ (which will be decommissioned in the future, see ADC Monitoring for further details).
    • statistics.file
      • Executes PL/SQL procedure "dashboard_new.compLatestStatsFile" every 10 minutes to calculate file statistics in 10 minute bins.
      • Executes PL/SQL procedure "dashboard_new.aggrLatestStatsFile" every 10 minutes to aggregate file statistics to 24 hour bins.
      • Statistics stored in DB table t_stats_file and t_stats_file_a.
      • Statistics available 2.x API and 2.x UI.
    • statistics.dq2
      • Reads statistics on dq2-get/put activity from ATLAS DDM database and merges into file statistics in 10 minute bins.
      • Statistics stored in DB table t_stats_file (and aggregated into t_stats_file_a by statistics.file).
      • Statistics available 2.x API and 2.x UI.
    • ddm.collector.agis
      • Reads site topology from AGIS and merges into site table every 1 hour.
      • Sites stored in DB table t_site.
  • Puppet hostgroup: dashboard/web_server/ddm/production
  • RPM: dashboard-service-monitor-site + dependencies
  • Database:
    • atlas_dashboard_dm_writer @ atlas_dashboard_dm (i.e. LCGR) for all agents.
    • atlas_dashb_dq2_r @ ADCR_ADG (i.e. Active Data Guard for ADCR) for dq2 stats agent.
  • Resources:
    • You can check the logs such as /opt/dashboard/var/log/dashb-statistics.file.log to see the agents are running every 10 minutes as expected.
    • You can check DB table t_agent to see that the agents are saving a heartbeat.
    • You can check the database session manager to see that no procedure is stuck.

Integration

AGIS (Managed by ATLAS)
  • As production.

Site Services (Managed by ATLAS)
  • As production except we receive very few transfers events.

ADCR Database (Managed by ATLAS)
  • Not used since statistics.dq2 agent is not running on integration.

HTTP consumers
  • As production except the following.
  • Machines: dashb-ai-531
  • Alias: DASHB-ATLAS-DATA-SOUP-TBED-CONSUMER
  • Puppet hostgroup: dashboard/web_server/ddm/integration
  • Database: atlas_dashboard_dm_writer @ int6r

Dashboard Database
  • As production except the following.
  • Admin: atlas_dashboard_dm @ int6r
  • Reader: atlas_dashboard_dm_reader @ int6r
  • Writer: atlas_dashboard_dm_writer @ int6r

UI servers
  • As production except the following.
  • Machines: dashb-ai-531
  • Alias: DASHB-ATLAS-DATA-SOUP-TBED
  • Puppet hostgroup: dashboard/web_server/ddm/integration
  • Database: atlas_dashboard_dm_reader @ int6r

Agents
  • As production except the following.
  • Machines: dashb-ai-531
  • Agents:
    • data.stats.collection (Legacy for 1.x)
    • transfer.rate.stats
    • statistics.file
    • statistics.dq2
      • Not running.
    • ddm.collector.agis
  • Puppet hostgroup: dashboard/web_server/ddm/integration
  • RPM: dashboard-service-monitor-site + dependencies
  • Database:
    • atlas_dashboard_dm_writer @ int6r for all agents.

Operations

  • There are no regular operational tasks.
  • Email alerts are configured on the main dashboard log.
  • If the production plots are empty:
    • Check /opt/dashboard/var/log on HTTP consumers are accessible, shows regular entries and no errors.
      • This would be due VM down (inaccessible) or database issues (slow insertion).
    • Check agent logs e.g. /opt/dashboard/var/log/dashb-statistics.file.log show regular (every 10 minutes entries).
      • This is typically because the database has issues. Usually another application is overloading it.
  • JIRA: https://its.cern.ch/jira/browse/WLCGMON/component/13710
  • See support references in deployment section above.

Upcoming

  • DDM (DQ2) Dashboard should be decommissioned after DQ2 has been decommissioned. Old statistics have already been copied to DDM (Rucio) Dashboard and current statistics are being copied by the agent ddm.statistics.dq2 on DDM (Rucio) Dashboard. So decommissioning consists of the following:
    • Consult with ADC Monitoring to co-ordinate the decommissioning.
    • Verify all file statistics have been copied to DDM (Rucio) Dashboard. This should already be the case.
    • Stop the ddm.statistics.dq2 agent on DDM (Rucio) Dashboard.
    • Delete DDM (DQ2) Dashboard servers and database.

-- DavidTuckett - 10 Sep 2014

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng ddm_dashb_rucio_matrix2.png r1 manage 27.3 K 2014-09-15 - 14:25 DavidTuckett  
PNGpng dq2_archi.png r1 manage 37.8 K 2014-09-10 - 18:50 DavidTuckett  
PNGpng rucio_archi.png r1 manage 40.0 K 2014-09-15 - 12:42 DavidTuckett  
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2014-09-15 - DavidTuckett
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback