Lemon for the CMS SAM client
Abnormal situations to be trapped
Test submission has stopped (i.e. the cron scripts do not run)
Detection
At least one of the SAM log files is older than 2 hours.
file.sslmtime
gives the age of a file in seconds or -1 if the file does not exist. We need to have a different metric for each log file.
Action
Raise an alarm which triggers a call to the CMS SAM support.
Configuration
The file is
customization/cms/jrobots_sam/sam_metrics.tpl
.
The relevant metrics are 4120, 4121 and 4122 and the exception is 30691.
One of the cron scripts results in a fatal error (e.g. cannot create a proxy)
Detection
A FATAL error was logged in the SAM log files in the last 2 hours.
Action
Raise an alarm which triggers a call to the CMS SAM support.
Configuration
The file is
customization/cms/jrobots_sam/sam_metrics.tpl
.
The relevant metrics are 5370, 5371 and 5372 and the exception is 30692.
Publication to the SAM database fails
Detection
The number of error messages in the last 2 hours complaining about database problems is larger than 20. See what the IT-GT SAM team does.
Action
Raise an alarm which triggers a call to the CMS SAM support.
Configuration
To be done.
Known ssues
- The actuator runs as root and therefore the e-mail is sent by root@vocms36NOSPAMPLEASE.cern.ch, which is rejected by the e-group. Temporarily using my e-mail address.
--
AndreaSciaba - 15-Mar-2010