Downtime handling


Keep track of the downtime information of the different sites. It should be able to differentiate between scheduled or unscheduled transfers, the severity of the downtime (at risk, moderate, outage), and the services affected

Current problem

CMS detected that American downtimes were not displayed correctly in the Dashboard Site Status Board. Also, all the 'unscheduled' downtimes are missing.

Background information

Sources of information:

Site administrators announce their downtimes in one of these two places:

OSG Information Management System (OIM) Collects information about American sites
GOCDB Collects information about WLCG sites (including some Americans)
Does not contain VO information

Intermediate aggregators

Application notes GOCDB OIM VO info VO defined site names
CIC Provides RSS feed Yes / Done No Yes / Done No
MyOSG Provides XML,CSV, igoogle, mobile .. format No Yes / Done Yes / Done No
Does not include hostname of the services
SAM DB Possibility to combine with results of SAM tests (selecting only services tested by the VO) Yes / Done Yes / Done Yes / Done No
The OIM part is missing since 30th June Savannah ticket
Requires direct DB connection for more advanced queries
Only scheduled outages are visible
It will become deprecated in favour of ATP
Aggregated Topology Provider (ATP) It will be the replacement of SAM DB Yes / Done Yes / Done Yes / Done Yes / Done
OIM seems to be missing right now...
under development

  • VO defined site names: Names that the VO have specified for the sites (e.g. T1_IT_CNAF, or )

Experiment specific tools

This list is not exhaustive:

Experiment TOOL Source
ALICE Mailing list CIC
ATLAS Google calendar Parsing GOCDB and MyOSG pages
CMS Google calendar Parsing GOCDB and MyOSG, and combining with service info from the Dashboard
  CMS Site Status Board SAM DB combined with Dashboard DB (investigating alternative sources)
LHCb Google calendar Accessing GOCDB database. This application will become deprecated
  Resource Service Status (Under development) It will combine downtime with other metrics

Detailed description of current CMS problem

  • CMS detected missing downtimes in SSB (ticket)
  • SSB developers tracked the problem to the SAM DB ticket
  • SAM DB tracked to OSG publisher -> tickets still open

How to solve the current CMS problem in SSB

Several alternatives:

  • Wait until OSG publisher problem in SAM solved
  • Wait until ATP fully operational
  • Reuse anything developed by the other experiments?
  • Collect the information from other sources
    • Being implemented as we speak ( see SSB devel).
    • Are we reinventing the wheel??
    • Instead of SAM DB, go directly to GOCDB and MyOSG
      • we have to do the site naming conversion ourselves
      • More difficult to combine with 'critical' CMS services
      • We get 'unscheduled', and 'at risk' downtimes
      • First prototype already in place
      • We have to implement the logic to distinguish between downtimes that affect all the instance of one service, and the ones that affect only some (e.g. if a site has a CE in maintenance, we check if there are any other working CE)
      • For GOCDB and MyOSG developers: the possibility of getting all the downtimes that have been modified since time X would make our life easier


For the time being (and until ATP is fully functional), the SSB will provide a general solution that can be used by any experiment.

-- PabloSaiz - 25-Aug-2010

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2010-08-26 - PabloSaiz
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback