Support FTS Monitoring Dashboard
This page documents the support of the FTS Monitoring Dashboard.
FTS Infrastructure
All machines are managed by puppet
Messaging infrastructure
- Production broker
- Topics
- FTS
- /queue/Consumer.dashb-fts.transfer.fts_monitoring_complete
- /queue/Consumer.dashb-fts.transfer.fts_monitoring_start
- /queue/Consumer.dashb-fts.transfer.fts_monitoring_state
- ASO
- /queue/Consumer.dashb-aso-jobmon.fts.aso
- /queue/Consumer.dashb-aso-ftsmon.fts.aso
- Queues
- FTS
- /topic/transfer.fts_monitoring_state
- /topic/transfer.fts_monitoring_start
- /topic/transfer.fts_monitoring_complete
Stompclt: message consumption
Stompclt is used to consume messages from a
SINGLE broker and store them on local disk.
Stompclt command shouldn't be ran manually but handle by simplevisor (see next part)
All configuration files can be found in: /opt/dashboard/etc/dashboard-simplevisor/
Example of a configuration file:
vim /opt/dashboard/etc/dashboard-simplevisor/fts_state_consumer.cfg
<incoming-broker>
auth = "x509 cert=/home/dboard/.security/usercert.pem,key=/home/dboard/.security/userkey.pem"
</incoming-broker>
outgoing-queue = "path=/opt/dashboard/var/messages/fts_state"
subscribe = "destination=/queue/Consumer.dashb-fts.transfer.fts_monitoring_state"
reliable = true
heart-beat=true
Basic command for stompclt
Start a consumer:
stompclt --incoming-broker-uri stomp+ssl://gridmsg202.cern.ch:6162 --conf /opt/dashboard/etc/dashboard-simplevisor/atlas_fax_consumer.cfg --pidfile /opt/dashboard/var/lock/gridmsg202.cern.ch-atlas_fax.pid --daemon
Check status of a specific consumer:
stompclt --pidfile /opt/dashboard/var/lock/gridmsg202.cern.ch-atlas_fax.pid --status
Stop a consumer:
stompclt --pidfile /opt/dashboard/var/lock/gridmsg202.cern.ch-atlas_fax.pid --quit
Check all the consumer running:
ps -ef |grep stompclt
Monitoring of stompclt
Stompclt clients are supervised by simplevisor and are silently restarted (if needed) every 30s.
Simplevisor: supervisor for all stompclt instances
The example in the previous section shown one stompclt instance is needed per broker and per queue. As a result a lot of stompclt client will be simultaneously on a single machine and occasionally crash. Simplevisor is used to supervise all this instances.
The simplevisor configuration file need to be re-generate every time machines are added or removed from the LB alias. An helper script has been develop to handle the configuration file creation and can be customized for ATLAS or CMS. Then, the generated file should be copied to the proper directory:
cd /opt/dashboard/doc/simplevisor/
python conf-generator.py <vo>
mv consumer-simplevisor.cfg /opt/dashboard/etc/dashboard-simplevisor/
Basic command for simplevisor
Start simplevisor:
simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg --daemon start
Check status of simplevisor:
simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg check
dashb-supervisor: OK, as expected
alice_traffic: OK, as expected
alice_traffic-mb103.cern.ch: OK, running, as expected
alice_traffic-mb104.cern.ch: OK, running, as expected
alice_traffic-mb108.cern.ch: OK, running, as expected
alice_traffic-mb203.cern.ch: OK, running, as expected
alice_traffic-mb202.cern.ch: OK, running, as expected
alice_file: OK, as expected
alice_file-mb103.cern.ch: OK, running, as expected
alice_file-mb104.cern.ch: OK, running, as expected
alice_file-mb108.cern.ch: OK, running, as expected
alice_file-mb203.cern.ch: OK, running, as expected
alice_file-mb202.cern.ch: OK, running, as expected
fts_start: OK, as expected
fts_start-mb103.cern.ch: OK, running, as expected
fts_start-mb104.cern.ch: OK, running, as expected
fts_start-mb108.cern.ch: OK, running, as expected
fts_start-mb203.cern.ch: OK, running, as expected
fts_start-mb202.cern.ch: OK, running, as expected
fts_complete: OK, as expected
fts_complete-mb103.cern.ch: OK, running, as expected
fts_complete-mb104.cern.ch: OK, running, as expected
fts_complete-mb108.cern.ch: OK, running, as expected
fts_complete-mb203.cern.ch: OK, running, as expected
fts_complete-mb202.cern.ch: OK, running, as expected
fts_state: OK, as expected
fts_state-mb103.cern.ch: OK, running, as expected
fts_state-mb104.cern.ch: OK, running, as expected
fts_state-mb108.cern.ch: OK, running, as expected
fts_state-mb203.cern.ch: OK, running, as expected
fts_state-mb202.cern.ch: OK, running, as expected
Stop simplevisor:
simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg stop
Monitoring of simplevisor
Every 30 minutes, a cronjob check that simplevisor is running and that all the underlying services are in an expected state. If any problem occurs, automatic restart is triggered and a notification is sent.
Simplevisor: supervisor for all stompclt instances
All the messages temporarily stored in the local queue need then to be sent to the DB.
Basic command for the collector
/opt/dashboard/bin/dashb-agent-list
/opt/dashboard/bin/dashb-agent-start fts.collectors
/opt/dashboard/bin/dashb-agent-restart fts.collectors
/opt/dashboard/bin/dashb-agent-stop fts.collectors
/opt/dashboard/bin/dashb-agent-status fts.collectors
Monitoring of the collectors
Every 30 minutes, a cronjob if the collector are runningas expected and restart them if needed.
Database
Monitoring & Automatic recovery procedure
To optimize the reaction time and the recovery time, various procedure have been set. Depending on the impact, two kind of action can be taken: N (notification) or R (service restart). The following table describes the implemented strategy on the machines.
|
Consumer crash |
Simplevisor crash |
collector crash |
httpd crash |
local queue stacked |
list alarmed |
FTS |
dashb-ai-578 |
R |
N+R |
N+R |
R |
N |
|
FTS |
dashb-ai-579 |
R |
N+R |
N+R |
R |
N |
|
FTS |
dashb-ai-552 |
R |
N+R |
N+R |
R |
N |
|
ASO |
dashb-ai-595 |
R |
N+R |
N+R |
N/A |
N |
|
--
HassenRiahi - 2014-12-15