Operational Use Cases and Status

HOWTO for CODs

  • New use case:
    When raising a new use case, please describe the context and the propose solution in the following list. Add an entry in the "Use cases status" table, and add an entry in handover log so that the new item is raised at the WLCG Weekly Operation meeting.

  • Update:
    Check every week the status of the GGUS/Savannah ticket of each use cases, and update the tables accordingly.

  • Closing a use case:
    When a use case is closed, please strike off the use case from the list below, adding HTML
    <strike></strike>
    tag around use case title. Move use case content from "Use cases status" table to "Closed use cases status" table.

Operational use cases list

1 Test nodes status and their operational handling (closed)
(to status)

From: ROC-NE, getting tickets on nodes dedicated to testing. Related also to ticket GGUS #24441 from ROC-FR last year.

Context: Monitoring flag of a production site "certified" forced to "ON" since last GOC DB release, and the operational handling of test nodes of a production site.

Rationale: When a node of a production site which is present in top-level BDII but not registered in the GOCDB, is raised an alarm upon, COD people do raise a ticket.

Proposed solution (to be debated at the ROC managers meeting):

  • as a transition step, sites are raised tickets against for that node so that they are advised to declare it in the GOCDB and put it in SD to turn off their monitoring and consequently not being SAM tested.
  • actual solution would be to simply allow for such nodes to be not monitored, i.e. to be able to define a "production status" to "test" at the node level in the GOCDB.

Consequences: Alarms consecutive to SAM tests should only be raised against all nodes provided:

  • nodes are published in the BDII and they are tagged as "production" (anything except "test") or non existent in the GOCDB.
  • nodes are not published in the BDII and they are tagged as "production" (anything except "test") in GOCDB.

2 Procedure to let a specific node out of the GOCDB clearly without getting SAM alarms (closed)
(to status)

From: David Bouvet - COD-FR

Context: Site wants to remove a node, reassign names of some of its nodes or introduce aliases' use.

Rationale: Procedure is unclear. Last recommendation on weekly operations meetings were:

  • set a schedule downtime at the node level in GOCDB
  • get to unpublish this node from site BDII
  • set the node monitoring tag to "off" in the GOCDB, which you cannot do at the moment for a node that belongs to a "certified" production site (see use case 1).
  • remove node from GOCDB, which will trigger a SAM testing phase during the SAM retention period of 3 days and potential alarms.

Proposed solution (to be debated at the ROC managers meeting):

  • get a certification status in GOCDB that allows to "close" a node before removal
  • fix GOCDB replication so that "retention period topic" in SAM does not trigger alarms or SAM failures.

3 Documentation of the "severity" field of the downtime section: "Outage, Severe, Moderate, At Risk" (closed)
(to status)

From: Rolf Rumler - ROC-FR

Context: How to specify this field?

Rationale: Procedure is unclear how to setup this parameter and on the use that could derived on Gridview statistics.

Proposed solution:

  • specify the conditions under which Gridview will take into account such a parameter
  • add explanation field on the definition agreed on by ROC managers in the GOCDB interface.

4 Unscheduled/scheduled functionality
(to status)

From: David Bouvet - ROC-FR

Context: How to specify this field?

Rationale: Procedure is unclear how to setup this parameter and on the use that could derived on Gridview statistics.

Proposed solution:

  • Gridview does not take it into account for the time being.
  • add explanation field on the status of this settings in the GOCDB interface

5 ROC 1rst line support role into GOCDB (closed)
(to status)

From: Marcin Radecki

Context: Introduction of regional dashboard for ROC 1rst line support

Rationale: How to manage staff of ROC 1rst line support?

Proposed solution: To keep the congruence with the way roles are handled in the EGEE project, a specific role need to be added in the GOC DB (cf. GGUS ticket #31128 from Marcin Radecki)

6 Specification of a "core node" into GOCDB for the finalization of the downtimes procedure. (closed)
(to status)

From: Osman Aidel - CIC portal team

Context: Finalization of the downtimes procedures for the operational core tools

Rationale: GGUS, GOCDB, CIC portal, SAM cannot specify downtimes in case of service interruption.

Proposed solution: Road map is established to implement the rationale; however GOCDB developments are stalled since December 2007 (cf. GGUS ticket # 31458 from Osman Aidel)

7 Alarms ARCCE (closed)
(to status)

From: Cyril L'Orphelin - COD-FR

Context: Since 2 months CODs have frequently received alarms linked with NDGF sites. These alarms are new alarms "ARCCE" but this kind of alarms have never been announced as critical. No announcement has been made for these alarms . What should CODs do with these alarms?

Proposed solution: It is not a problem to raise tickets for these alarms, but an official announcement is necessary.

8 Synchronization of SAM DB (closed)
(to status)

From: Cyril L'Orphelin - COD-FR

Context: As described in point 2, there is a retention period between SAM DB / GOCDB. This period is an obstacle in the daily work of COD people. Moreover the synchronization must be improved for scheduled downtime .

Rationale: SAM has a replication of GOC DB. When a node is added, no delay is observed and SAM DB get the new node for the next test pass. But when a node is removed, a 3 days retention period (in the best case) occurs. There is no need of this retention period as in GOC DB there is a field 'ACTIVE' (in table 'PATH') which is set to 'N' when a node is removed.

Proposed solution: A real time synchronization will be better to improve the quality on SAM tests and on COD's work. As a transition step, a flag announcing a downtime is in place on the CIC portal.

Some tickets opened about this problem:

9 Last escalation step/Site suspension follow-up
(to status)

From: David Bouvet - COD-FR

Context: Follow-up of last escalation step by OCC and ROC not correctly done. When last step is reached, as stated in Operational Manual, ROC should normally discuss in private with its site, and then tell at next Weekly Operation meeting if the site should be suspend or not. Most of the time, at Weekly Operation meeting, ROC says that it has too discuss, and then no more news. The site stay in last escalation step during several weeks.

In Operational Manual: "If no progress is made, COD make sure that OMC is informed of the situation, and the site status is set to “suspended” in GOCDB by COD unless OMC say differently."

Proposed solution: As COD has rights to suspend a site, if ROC is not present at Weekly Operation meeting or has not send a mail about that problem, COD suspends the site. If ROC is present and asks for discussion with its site, OCC should put an action on ROC in the list of actions of the Weekly Operation meeting so it will be followed at next meeting. Answer or suspension by ROC should be done within the next 3 days: as acknowledgement, a mail should be sent to both OCC and COD mailing lists. In case not, the site is suspended by COD after these 3 days.

Some example of "long" last step:

  • GGUS #40521: RU-Phys-SPbSU (1 month and a half)
    • 25/09/2008: last escalation step
    • 06/10/2008: raised at WLCG Ops meeting
    • 06/11/2008: still in last step and not suspended
    • 06/11/2008: Cyril L'Orphelin (COD-FR) send mail to Maite, Steve and Nick
    • 06/11/2008: Maite sent mail to Russian ROC
    • 06/11/2008: site suspended by Russian ROC
  • GGUS #42015: ITPA-LCG2 (4 weeks)
    • 24/10/2008: last escalation step
    • 27/10/2008: raised at WLCG Ops meeting
    • 03/11/2008: raised again at WLCG Ops meeting
    • 07/11/2008: still in last step and not suspended
    • 10/11/2008: raised again at WLCG Ops meeting
    • 17/11/2008: still in last step and not suspended. ROC North is present at WLCG Ops meeting and will check with site.
    • 18/11/2008: finally fixed by site

Solution: modification of the escalation process to be reflected in COD ops manual Changes from Old version to new version to be reflected in the attached file - extract from mail dated 13/01/09 to ROC managers by HC, no feedback and hence validated on February 3rd.

10 Failing SAM tests due to mw failure
(to status)

From: Helene CORDIER - COD-FR

Context: From Diana Bosio, on 02/02/09 "The alarm FTS-infosites on fts-t1import.cern.ch which is failing due to the fact that the middleware does not foresee the current production scenario in use at CERN. Developers are aware, a bug has been opened and it will be like this until the bug is fixed. What shall we do with the alarm?"

Proposed Solution: Close the alarm and not raise tickets until the corresponding GGUS ticket and Savannah bug is closed --- cf GGUS tickets: #45163, #44954, #44635 and Savannah bug #46083

11 nodes not declared in GOCDB
(to status)

From: Helene CORDIER - COD-FR

Context: From Diana Bosio, on 02/02/09 handover: A few nodes appeared not to be registered in the GOCDB:

  1. ROC DECH: udo-dcache01.grid.uni-dortmund.de udo-dcache03.grid.uni-dortmund.de udo-ce01.grid.uni-dortmund.de rb-goegrid.local
  2. ROC ITALY: gridit002.pd.infn.it atlas-ce-02.roma1.infn.it

Proposed Solution: Check https://cic.gridops.org/index.php?section=cod&page=comparator and raise tickets against these sites to properly register themselves in GOCDB.

12 Checklist when sites go uncertified in GOCDB
(to status)

From: Helene CORDIER - COD-FR

Context: From Malgorzata Krakowian, on 27/04/09 c-cod mailing list: outdated ticket from a site once revoked on Feb 2th 2009 and then back in production on April 21th 2009.

In fact the ticket has not been closed before the uncertification and with the re-certification it appears again.

Proposed Solution: Addendum to the Sites/ROCs ops manual :Ask ROD and/or C-COD to close tickets before the site uncertification to avoid problems in ticket handling.

13 Closing alarm before uncertifying a site
(to status)

From: David Bouvet - C-COD-FR

Context: When a site is uncertified, if there are new SAM alarms raised for that site, they are not switched of automatically. Alarms age during the uncertified period. Thus when the site came back certified, site appears in the dashboard with old new alarms.

Proposed solution: SAM should switch off new alarms of uncertified sites.

As a transition step, before uncertifying a site, RODs should close new alarms for that site.

14 Operational handling of test nodes (production status "off" and monitoring "on")
(to status)

From: ROC-NE, getting tickets on nodes dedicated to testing.

Context: When site admins want to test a node, they declare it in GOCDB with production status "off" (= test node) and monitoring status "on" to be tested by SAM/Nagios. As the monitoring status is "on", alarms (and thus tickets) are raised against site for that node. SAM/Nagios does not take into account the node production status.

Proposed solution: SAM/Nagios should filter on production status of the node too (not only on monitoring status)


Open Use cases

Use cases in red have a higher priority.

Use case title # Raised at Included in ARM action list Involved parties GGUS Savannah Date in Status (date of last change) Comments
Definition/Doc in GOC 4 ARM11 #5 OCC/GOC/Gridview #33175 #6993 21/02/2008 unsolved (25/06/2008) 25/06/2008: GOCDB Advisory Group add this item to its task list in Savannah
14/05/2008: GOCDB development back in the air.
12/03/2008: GOCDB development suspended
Site suspension 9     OCC/ROC  #40521    07/11/2008 Done 17/11/2008: raised at WLCG meeting but no real discussion on this point, discussed at SA1 coordination meeting and modification to the escalation procedure recorded on Feb 3rd 2009
nodes not registered in GOCDB 11 Weekly ops meeting tickets to be opened /comparator to be checked ROC CERN/OCC n/a n/a 02/01/09 02/02/09 recorded here as a suggestion for next version of the ops manual or COD use
site revoked and back in production in a matter of months 12 c-cod mailing lists next version of site/ROC ops manual pole2 n/a n/a 28/04/09 28/04/09 recorded here as a suggestion for next version of the ops manual
Test nodes 14 c-cod mailing lists         19/02/2010 19/02/2010  


Closed use cases status

Use case title # Raised at Included in ARM action list Involved parties GGUS Savannah Date in Status (date of last change) Comments
Core node 6 ARM11 #6 GOC/CIC #31458 #7079 17/01/2008 Solved (14/10/2008) 14/10/2008: Integrated in GOCDB 3.1.1
19/08/2008: Development nearly complete on CIC portal side
25/06/2008: GOCDB Advisory Group add this item to its task list in Savannah.
14/05/2008: GOCDB development back in the air.
21/01/2008: GOCDB development suspended
Test node 1 ARM11 #3 OCC/GOC #33666 #7219 04/03/2008 Solved (14/10/2008) 14/10/2008: Savannah task closed: possibility to put monitoring status off for node not in production
15/04/2008: Feature to be discussed in advisory group
12/03/2008: GOCDB development suspended
1rst line support role in GOC 5 ARM11 #6 GOC/CIC portal #31128   24/06/2008 solved and verified (24/06/2008) 24/06/2008: Changes integrated in CIC portal.
12/06/2008: Role added in GOCDB
14/05/2008: GOCDB development back in the air.
18/03/2008: all informations provided to implement it
Node removal 2 ARM11 #2, #4 OCC/SAM/Gridview #34233 #7228 17/03/2008 solved (11/08/2008) 11/08/2008: GGUS ticket solved. No more retention period.
25/06/2008: Creation of Savannah ticket #7228 by Gilles for a flag in GOCDB on decommissioned nodes.
26/05/2008: We hope to implement this later this week & test in validation. If all goes well, we should have a working solution in production later next week.
19/05/2008: Following Gridview answer ask SAM to remove retention period. Ask SAM an update of the ticket.
05/05/2008: Gridview says that Gridview DB is fully synchronized wih GOCDB. They reassign the ticket to SAM.
15/04/2008: SAM team answered it was not notified of the ARM decision... but in ARM list of action they seem to be.
Retention period and SD in SAM 8 ARM11 #4 OCC/SAM/Gridview #34233 #7228 17/03/2008 solved (11/08/2008) 11/08/2008: GGUS ticket solved. No more retention period.
25/06/2008: Creation of Savannah ticket #7228 by Gilles for a flag in GOCDB on decommissioned nodes.
26/05/2008: We hope to implement this later this week & test in validation. If all goes well, we should have a working solution in production later next week.
19/05/2008: Following Gridview answer ask SAM to remove retention period. Ask SAM an update of the ticket.
05/05/2008: Gridview says that Gridview DB is fully synchronized wih GOCDB. They reassign the ticket to SAM.
15/04/2008: SAM team answered it was not notified of the ARM decision... but in ARM list of action they seem to be.
Definition/Doc in GOC 3 ARM11 #5 GOC/OCC #33175 #6993 21/02/2008 unsolved (25/06/2008) 06/08/2008: Savannah bug closed -> in production with GOCDB 3.1
25/06/2008: GOCDB Advisory Group add this item to its task list in Savannah
14/05/2008: GOCDB development back in the air.
12/03/2008: GOCDB development suspended
ARCCE 7 COD15   OCC/SAM #33146 #34248 21/02/2008 solved (30/06/2008) 30/06/2008: Bug fixed and in production
16/05/2008: Savannah bug => "Ready for test" status
18/04/2008: ARCCE failed tests raise COD alarms -> Savannah bug updated
06/03/2008: unsolved as Savannah bug openned : SAM said it will be in production around March 20th, but Savannah bug is still opened
Failing SAM tests due to mw bug 10 Weekly ops meeting alarms to be closed ROC CERN/OCC [[https://gus.fzk.de/ws/ticket_info.php?ticket=45163] [#45163]],#44954, [[https://gus.fzk.de/ws/ticket_info.php?ticket=44635] [#44635]] [[https://savannah.cern.ch/bugs/?46083] [#46083]] ] 02/01/09 02/02/09 recorded here for COD use
Alarm for uncertifed site 13 18/12/09

Procedure # Raised at Included in ARM action list Involved parties GGUS Savannah Date in Status
CA update procedure suggestion   ARM11 #8, #9, #10       04/03/2008 ok

Back to top


-- DavidBouvet - 05 Mar 2008

Topic attachments
I Attachment History Action Size Date Who Comment
Microsoft Word filedoc 2009___escalation_process.doc r1 manage 35.0 K 2009-02-03 - 18:24 HeleneCordier change in site supension procedure
Edit | Attach | Watch | Print version | History: r29 < r28 < r27 < r26 < r25 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r29 - 2010-02-19 - DavidBouvet
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback