Support FTS Monitoring Dashboard

This page documents the support of the FTS Monitoring Dashboard.

FTS Infrastructure

All machines are managed by puppet

Messaging infrastructure

  • Production broker
    • dashb-mb.cern.ch

  • Queues
    • FTS
      • /queue/Consumer.dashb-fts.transfer.fts_monitoring_complete
      • /queue/Consumer.dashb-fts.transfer.fts_monitoring_start
      • /queue/Consumer.dashb-fts.transfer.fts_monitoring_state
    • ALICE
      • /queue/Consumer.dashb-alice.xrootd.alice.site_file
      • /queue/Consumer.dashb-alice.xrootd.alice.site_traffic
    • ASO
      • /queue/Consumer.dashb-aso-jobmon.fts.aso
      • /queue/Consumer.dashb-aso-ftsmon.fts.aso

  • Topics
    • FTS
      • /topic/transfer.fts_monitoring_state
      • /topic/transfer.fts_monitoring_start
      • /topic/transfer.fts_monitoring_complete

Stompclt: message consumption

Stompclt is used to consume messages from a SINGLE broker and store them on local disk. Stompclt command shouldn't be ran manually but handle by simplevisor (see next part)

All configuration files can be found in: /opt/dashboard/etc/dashboard-simplevisor/

Example of a configuration file:

vim /opt/dashboard/etc/dashboard-simplevisor/fts_state_consumer.cfg 
<incoming-broker>
  auth = "x509 cert=/home/dboard/.security/usercert.pem,key=/home/dboard/.security/userkey.pem"
</incoming-broker>

outgoing-queue = "path=/opt/dashboard/var/messages/fts_state"
subscribe = "destination=/queue/Consumer.dashb-fts.transfer.fts_monitoring_state"
reliable = true
heart-beat=true

Basic command for stompclt

Start a consumer:

stompclt --incoming-broker-uri stomp+ssl://mb202.cern.ch:61123 --conf /opt/dashboard/etc/dashboard-simplevisor/fts_state_consumer.cfg --pidfile /opt/dashboard/var/lock/mb202.cern.ch-fts_state.pid --daemon

Check status of a specific consumer:

stompclt --pidfile /opt/dashboard/var/lock/mb202.cern.ch-fts_state.pid --status

Stop a consumer:

stompclt --pidfile /opt/dashboard/var/lock/mb202.cern.ch-fts_state.pid --quit

Check all the consumer running:

 ps -ef |grep stompclt

Monitoring of stompclt

Stompclt clients are supervised by simplevisor and are silently restarted (if needed) every 30s.

Simplevisor: supervisor for all stompclt instances

The example in the previous section shown one stompclt instance is needed per broker and per queue. As a result a lot of stompclt client will be simultaneously on a single machine and occasionally crash. Simplevisor is used to supervise all this instances.

The simplevisor configuration file need to be re-generate every time machines are added or removed from the LB alias. An helper script has been develop to handle the configuration file creation and can be customized for ATLAS or CMS. Then, the generated file should be copied to the proper directory:

cd /opt/dashboard/doc/simplevisor/
python conf-generator.py <vo>
mv consumer-simplevisor.cfg /opt/dashboard/etc/dashboard-simplevisor/

Basic command for simplevisor

Start simplevisor:

simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg --daemon start

Check status of simplevisor:

simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg check
dashb-supervisor: OK, as expected
  alice_traffic: OK, as expected
    alice_traffic-mb103.cern.ch: OK, running, as expected
    alice_traffic-mb104.cern.ch: OK, running, as expected
    alice_traffic-mb108.cern.ch: OK, running, as expected
    alice_traffic-mb203.cern.ch: OK, running, as expected
    alice_traffic-mb202.cern.ch: OK, running, as expected
  alice_file: OK, as expected
    alice_file-mb103.cern.ch: OK, running, as expected
    alice_file-mb104.cern.ch: OK, running, as expected
    alice_file-mb108.cern.ch: OK, running, as expected
    alice_file-mb203.cern.ch: OK, running, as expected
    alice_file-mb202.cern.ch: OK, running, as expected
  fts_start: OK, as expected
    fts_start-mb103.cern.ch: OK, running, as expected
    fts_start-mb104.cern.ch: OK, running, as expected
    fts_start-mb108.cern.ch: OK, running, as expected
    fts_start-mb203.cern.ch: OK, running, as expected
    fts_start-mb202.cern.ch: OK, running, as expected
  fts_complete: OK, as expected
    fts_complete-mb103.cern.ch: OK, running, as expected
    fts_complete-mb104.cern.ch: OK, running, as expected
    fts_complete-mb108.cern.ch: OK, running, as expected
    fts_complete-mb203.cern.ch: OK, running, as expected
    fts_complete-mb202.cern.ch: OK, running, as expected
  fts_state: OK, as expected
    fts_state-mb103.cern.ch: OK, running, as expected
    fts_state-mb104.cern.ch: OK, running, as expected
    fts_state-mb108.cern.ch: OK, running, as expected
    fts_state-mb203.cern.ch: OK, running, as expected
    fts_state-mb202.cern.ch: OK, running, as expected

Stop simplevisor:

simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg stop

Monitoring of simplevisor

Every 30 minutes, a cronjob check that simplevisor is running and that all the underlying services are in an expected state. If any problem occurs, automatic restart is triggered and a notification is sent.

Simplevisor: supervisor for all stompclt instances

All the messages temporarily stored in the local queue need then to be sent to the DB.

Basic command for the collector

/opt/dashboard/bin/dashb-agent-list
/opt/dashboard/bin/dashb-agent-start fts.collectors/alice.collectors/aso_collectors
/opt/dashboard/bin/dashb-agent-restart fts.collectors/alice.collectors/aso_collectors
/opt/dashboard/bin/dashb-agent-stop fts.collectors/alice.collectors/aso_collectors
/opt/dashboard/bin/dashb-agent-status fts.collectors/alice.collectors/aso_collectors

Monitoring of the collectors

a cronjob is running every 5 min to check if the collectors are running as expected and restart them if needed. Another cron for the service is running very 30 sec to check and restart services in case of problems

Database

Contact: hassen.riahi@cern.ch/0021620208020

Monitoring & Automatic recovery procedure

To optimize the reaction time and the recovery time, various procedure have been set. Depending on the impact, two kind of action can be taken: N (notification) or R (service restart). The following table describes the implemented strategy on the machines.

  Consumer crash Simplevisor crash collector crash httpd crash local queue stacked list alarmed
FTS dashb-ai-578 R N+R N+R R N Broker: wlcg-transfer-msg@cernNOSPAMPLEASE.ch. Simplevisor: dashb-fts-alarms@cernNOSPAMPLEASE.ch
ALICE dashb-ai-578 R N+R N+R R N Simplevisor: dashb-fts-alarms@cernNOSPAMPLEASE.ch
FTS dashb-ai-579 R N+R N+R R N Broker: wlcg-transfer-msg@cernNOSPAMPLEASE.ch. Simplevisor: dashb-fts-alarms@cernNOSPAMPLEASE.ch
ALICE dashb-ai-579 R N+R N+R R N Simplevisor: dashb-fts-alarms@cernNOSPAMPLEASE.ch
FTS/ALICE dashb-ai-552 R N+R N+R R N Broker: wlcg-transfer-msg@cernNOSPAMPLEASE.ch. Simplevisor: dashb-fts-alarms@cernNOSPAMPLEASE.ch
ASO dashb-ai-595 R N+R N+R N/A N Broker: aso-msg@cernNOSPAMPLEASE.ch. Simplevisor: dashb-fts-alarms@cernNOSPAMPLEASE.ch

-- HassenRiahi - 2014-12-15

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2014-12-16 - HassenRiahi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback