Downtime handling
Goal
Keep track of the downtime information of the different sites. It should be able to differentiate between scheduled or unscheduled transfers, the severity of the downtime (at risk, moderate, outage), and the services affected
Current problem
CMS detected that American downtimes were not displayed correctly in the Dashboard Site Status Board. Also, all the 'unscheduled' downtimes are missing.
Background information
Sources of information:
Site administrators announce their downtimes in one of these two places:
Intermediate aggregators
Application |
notes |
GOCDB |
OIM |
VO info |
VO defined site names |
CIC |
Provides RSS feed |
|
|
|
|
MyOSG |
Provides XML,CSV, igoogle, mobile .. format |
|
|
|
|
Does not include hostname of the services |
SAM DB |
Possibility to combine with results of SAM tests (selecting only services tested by the VO) |
|
|
|
|
The OIM part is missing since 30th June Savannah ticket |
Requires direct DB connection for more advanced queries |
Only scheduled outages are visible |
It will become deprecated in favour of ATP |
Aggregated Topology Provider (ATP) |
It will be the replacement of SAM DB |
|
|
|
|
OIM seems to be missing right now... |
under development |
- VO defined site names: Names that the VO have specified for the sites (e.g. T1_IT_CNAF, or LCG.Oxford.uk )
Experiment specific tools
This list is not exhaustive:
Detailed description of current CMS problem
- CMS detected missing downtimes in SSB (ticket)
- SSB developers tracked the problem to the SAM DB ticket
- SAM DB tracked to OSG publisher -> tickets still open
How to solve the current CMS problem in SSB
Several alternatives:
- Wait until OSG publisher problem in SAM solved
- Wait until ATP fully operational
- Reuse anything developed by the other experiments?
- Collect the information from other sources
- Being implemented as we speak ( see SSB devel
).
- Are we reinventing the wheel??
- Instead of SAM DB, go directly to GOCDB and MyOSG
- we have to do the site naming conversion ourselves
- More difficult to combine with 'critical' CMS services
- We get 'unscheduled', and 'at risk' downtimes
- First prototype already in place
- We have to implement the logic to distinguish between downtimes that affect all the instance of one service, and the ones that affect only some (e.g. if a site has a CE in maintenance, we check if there are any other working CE)
- For GOCDB and MyOSG developers: the possibility of getting all the downtimes that have been modified since time X would make our life easier
Action
For the time being (and until ATP is fully functional), the
SSB will provide a general solution that can be used by any experiment.
--
PabloSaiz - 25-Aug-2010