Operational Use Cases and Status
HOWTO for CODs
- New use case:
When raising a new use case, please describe the context and the propose solution in the following list. Add an entry in the "Use cases status" table, and add an entry in handover log so that the new item is raised at the WLCG Weekly Operation meeting.
- Update:
Check every week the status of the GGUS/Savannah ticket of each use cases, and update the tables accordingly.
- Closing a use case:
When a use case is closed, please strike off the use case from the list below, adding HTML<strike></strike>
tag around use case title. Move use case content from "Use cases status" table to "Closed use cases status" table.
Operational use cases list
1 Test nodes status and their operational handling (closed)
(to status)
From: ROC-NE, getting tickets on nodes dedicated to testing. Related also to ticket GGUS #24441
from ROC-FR last year.
Context:
Monitoring flag of a production site "certified" forced to "ON" since last GOC DB release, and the operational handling of test nodes of a production site.
Rationale:
When a node of a production site which is present in top-level
BDII but not registered in the GOCDB, is raised an alarm upon, COD people do raise a ticket.
Proposed solution (to be debated at the ROC managers meeting):
- as a transition step, sites are raised tickets against for that node so that they are advised to declare it in the GOCDB and put it in SD to turn off their monitoring and consequently not being SAM tested.
- actual solution would be to simply allow for such nodes to be not monitored, i.e. to be able to define a "production status" to "test" at the node level in the GOCDB.
Consequences:
Alarms consecutive to SAM tests should only be raised against all nodes provided:
- nodes are published in the BDII and they are tagged as "production" (anything except "test") or non existent in the GOCDB.
- nodes are not published in the BDII and they are tagged as "production" (anything except "test") in GOCDB.
2 Procedure to let a specific node out of the GOCDB clearly without getting SAM alarms (closed)
(to status)
From: David Bouvet - COD-FR
Context:
Site wants to remove a node, reassign names of some of its nodes or introduce aliases' use.
Rationale:
Procedure is unclear. Last recommendation on weekly operations meetings were:
- set a schedule downtime at the node level in GOCDB
- get to unpublish this node from site BDII
- set the node monitoring tag to "off" in the GOCDB, which you cannot do at the moment for a node that belongs to a "certified" production site (see use case 1).
- remove node from GOCDB, which will trigger a SAM testing phase during the SAM retention period of 3 days and potential alarms.
Proposed solution (to be debated at the ROC managers meeting):
- get a certification status in GOCDB that allows to "close" a node before removal
- fix GOCDB replication so that "retention period topic" in SAM does not trigger alarms or SAM failures.
3 Documentation of the "severity" field of the downtime section: "Outage, Severe, Moderate, At Risk" (closed)
(to status)
From: Rolf Rumler - ROC-FR
Context:
How to specify this field?
Rationale:
Procedure is unclear how to setup this parameter and on the use that could derived on Gridview statistics.
Proposed solution:
- specify the conditions under which Gridview will take into account such a parameter
- add explanation field on the definition agreed on by ROC managers in the GOCDB interface.
4 Unscheduled/scheduled functionality
(to status)
From: David Bouvet - ROC-FR
Context:
How to specify this field?
Rationale:
Procedure is unclear how to setup this parameter and on the use that could derived on Gridview statistics.
Proposed solution:
- Gridview does not take it into account for the time being.
- add explanation field on the status of this settings in the GOCDB interface
5 ROC 1rst line support role into GOCDB (closed)
(to status)
From: Marcin Radecki
Context:
Introduction of regional dashboard for ROC 1rst line support
Rationale:
How to manage staff of ROC 1rst line support?
Proposed solution:
To keep the congruence with the way roles are handled in the EGEE project, a specific role need to be added in the GOC DB (cf.
GGUS ticket #31128
from Marcin Radecki)
6 Specification of a "core node" into GOCDB for the finalization of the downtimes procedure. (closed)
(to status)
From: Osman Aidel - CIC portal team
Context:
Finalization of the downtimes procedures for the operational core tools
Rationale:
GGUS, GOCDB, CIC portal, SAM cannot specify downtimes in case of service interruption.
Proposed solution:
Road map is established to implement the rationale; however GOCDB developments are stalled since December 2007 (cf.
GGUS ticket # 31458
from Osman Aidel)
7 Alarms ARCCE (closed)
(to status)
From: Cyril L'Orphelin - COD-FR
Context:
Since 2 months CODs have frequently received alarms linked with
NDGF sites.
These alarms are new alarms "ARCCE" but this kind of alarms have never been announced as critical.
No announcement has been made for these alarms . What should CODs do with these alarms?
Proposed solution:
It is not a problem to raise tickets for these alarms, but an official announcement is necessary.
8 Synchronization of SAM DB (closed)
(to status)
From: Cyril L'Orphelin - COD-FR
Context:
As described in point 2, there is a retention period between SAM DB / GOCDB.
This period is an obstacle in the daily work of COD people.
Moreover the synchronization must be improved for scheduled downtime .
Rationale:
SAM has a replication of GOC DB.
When a node is added, no delay is observed and SAM DB get the new node for the next test pass. But when a node is removed, a 3 days retention period (in the best case) occurs. There is no need of this retention period as in GOC DB there is a field 'ACTIVE' (in table 'PATH') which is set to 'N' when a node is removed.
Proposed solution:
A real time synchronization will be better to improve the quality on SAM tests and on COD's work.
As a transition step, a flag announcing a downtime is in place on the CIC portal.
Some tickets opened about this problem:
9 Last escalation step/Site suspension follow-up
(to status)
From: David Bouvet - COD-FR
Context:
Follow-up of last escalation step by OCC and ROC not correctly done. When last step is reached, as stated in Operational Manual, ROC should normally discuss in private with its site, and then tell at next Weekly Operation meeting if the site should be suspend or not. Most of the time, at Weekly Operation meeting, ROC says that it has too discuss, and then no more news. The site stay in last escalation step during several weeks.
In Operational Manual: "If no progress is made, COD make sure that OMC is informed of the situation, and the
site status is set to “suspended” in GOCDB by COD unless OMC say differently."
Proposed solution:
As COD has rights to suspend a site, if ROC is not present at Weekly Operation meeting or has not send a mail about that problem, COD suspends the site.
If ROC is present and asks for discussion with its site, OCC should put an action on ROC in the list of actions of the Weekly Operation meeting so it will be followed at next meeting. Answer or suspension by ROC should be done within the next 3 days: as acknowledgement, a mail should be sent to both OCC and COD mailing lists. In case not, the site is suspended by COD after these 3 days.
Some example of "long" last step:
- GGUS #40521: RU-Phys-SPbSU (1 month and a half)
- 25/09/2008: last escalation step
- 06/10/2008: raised at WLCG Ops meeting
- 06/11/2008: still in last step and not suspended
- 06/11/2008: Cyril L'Orphelin (COD-FR) send mail to Maite, Steve and Nick
- 06/11/2008: Maite sent mail to Russian ROC
- 06/11/2008: site suspended by Russian ROC
- GGUS #42015: ITPA-LCG2 (4 weeks)
- 24/10/2008: last escalation step
- 27/10/2008: raised at WLCG Ops meeting
- 03/11/2008: raised again at WLCG Ops meeting
- 07/11/2008: still in last step and not suspended
- 10/11/2008: raised again at WLCG Ops meeting
- 17/11/2008: still in last step and not suspended. ROC North is present at WLCG Ops meeting and will check with site.
- 18/11/2008: finally fixed by site
Solution: modification of the escalation process to be reflected in COD ops manual
Changes from Old version to new version to be reflected in the attached file - extract from mail dated 13/01/09 to ROC managers by HC, no feedback and hence validated on February 3rd.
10 Failing SAM tests due to mw failure
(to status)
From: Helene CORDIER - COD-FR
Context:
From Diana Bosio, on 02/02/09 "The alarm
FTS-infosites on fts-t1import.cern.ch which is failing due to the fact that the middleware does not foresee the current production scenario in use at CERN. Developers are aware, a bug has been opened and it will be like this until the bug is fixed. What shall we do with the alarm?"
Proposed Solution:
Close the alarm and not raise tickets until the corresponding GGUS ticket and Savannah bug is closed --- cf GGUS tickets:
#45163
,
#44954
,
#44635
and Savannah bug
#46083
11 nodes not declared in GOCDB
(to status)
From: Helene CORDIER - COD-FR
Context:
From Diana Bosio, on 02/02/09 handover:
A few nodes appeared not to be registered in the GOCDB:
- ROC DECH: udo-dcache01.grid.uni-dortmund.de udo-dcache03.grid.uni-dortmund.de udo-ce01.grid.uni-dortmund.de rb-goegrid.local
- ROC ITALY: gridit002.pd.infn.it atlas-ce-02.roma1.infn.it
Proposed Solution:
Check
https://cic.gridops.org/index.php?section=cod&page=comparator
and raise tickets against these sites to properly register themselves in GOCDB.
12 Checklist when sites go uncertified in GOCDB
(to status)
From: Helene CORDIER - COD-FR
Context:
From Malgorzata Krakowian, on 27/04/09 c-cod mailing list:
outdated ticket from a site once revoked on Feb 2th 2009 and then back in production on April 21th 2009.
In fact the ticket has not been closed before the uncertification and with the re-certification it appears again.
Proposed Solution:
Addendum to the Sites/ROCs ops manual :Ask ROD and/or C-COD to close tickets before the site uncertification to avoid problems in ticket handling.
13 Closing alarm before uncertifying a site
(to status)
From: David Bouvet - C-COD-FR
Context:
When a site is uncertified, if there are new SAM alarms raised for that site, they are not switched of automatically.
Alarms age during the uncertified period. Thus when the site came back certified, site appears in the dashboard with old new alarms.
Proposed solution:
SAM should switch off new alarms of uncertified sites.
As a transition step, before uncertifying a site, RODs should close new alarms for that site.
14 Operational handling of test nodes (production status "off" and monitoring "on")
(to status)
From: ROC-NE, getting tickets on nodes dedicated to testing.
Context:
When site admins want to test a node, they declare it in GOCDB with production status "off" (= test node) and monitoring status "on" to be tested by SAM/Nagios.
As the monitoring status is "on", alarms (and thus tickets) are raised against site for that node. SAM/Nagios does not take into account the node production status.
Proposed solution:
SAM/Nagios should filter on production status of the node too (not only on monitoring status)
Open Use cases
Use cases in
red have a higher priority.
Use case title |
# |
Raised at |
Included in ARM action list |
Involved parties |
GGUS |
Savannah |
Date in |
Status (date of last change) |
Comments |
Definition/Doc in GOC |
4 |
ARM11 |
#5 |
OCC/GOC/Gridview |
#33175 |
#6993 |
21/02/2008 |
unsolved (25/06/2008) |
25/06/2008: GOCDB Advisory Group add this item to its task list in Savannah 14/05/2008: GOCDB development back in the air. 12/03/2008: GOCDB development suspended |
Site suspension |
9 |
|
|
OCC/ROC |
#40521 |
|
07/11/2008 |
Done |
17/11/2008: raised at WLCG meeting but no real discussion on this point, discussed at SA1 coordination meeting and modification to the escalation procedure recorded on Feb 3rd 2009 |
nodes not registered in GOCDB |
11 |
Weekly ops meeting |
tickets to be opened /comparator to be checked |
ROC CERN/OCC |
n/a |
n/a |
02/01/09 |
02/02/09 |
recorded here as a suggestion for next version of the ops manual or COD use |
site revoked and back in production in a matter of months |
12 |
c-cod mailing lists |
next version of site/ROC ops manual |
pole2 |
n/a |
n/a |
28/04/09 |
28/04/09 |
recorded here as a suggestion for next version of the ops manual |
Test nodes |
14 |
c-cod mailing lists |
|
|
|
|
19/02/2010 |
19/02/2010 |
|
Closed use cases status
Use case title |
# |
Raised at |
Included in ARM action list |
Involved parties |
GGUS |
Savannah |
Date in |
Status (date of last change) |
Comments |
Core node |
6 |
ARM11 |
#6 |
GOC/CIC |
#31458 |
#7079 |
17/01/2008 |
Solved (14/10/2008) |
14/10/2008: Integrated in GOCDB 3.1.1 19/08/2008: Development nearly complete on CIC portal side 25/06/2008: GOCDB Advisory Group add this item to its task list in Savannah. 14/05/2008: GOCDB development back in the air. 21/01/2008: GOCDB development suspended |
Test node |
1 |
ARM11 |
#3 |
OCC/GOC |
#33666 |
#7219 |
04/03/2008 |
Solved (14/10/2008) |
14/10/2008: Savannah task closed: possibility to put monitoring status off for node not in production 15/04/2008: Feature to be discussed in advisory group 12/03/2008: GOCDB development suspended |
1rst line support role in GOC |
5 |
ARM11 |
#6 |
GOC/CIC portal |
#31128 |
|
24/06/2008 |
solved and verified (24/06/2008) |
24/06/2008: Changes integrated in CIC portal. 12/06/2008: Role added in GOCDB 14/05/2008: GOCDB development back in the air. 18/03/2008: all informations provided to implement it |
Node removal |
2 |
ARM11 |
#2, #4 |
OCC/SAM/Gridview |
#34233 |
#7228 |
17/03/2008 |
solved (11/08/2008) |
11/08/2008: GGUS ticket solved. No more retention period. 25/06/2008: Creation of Savannah ticket #7228 by Gilles for a flag in GOCDB on decommissioned nodes. 26/05/2008: We hope to implement this later this week & test in validation. If all goes well, we should have a working solution in production later next week. 19/05/2008: Following Gridview answer ask SAM to remove retention period. Ask SAM an update of the ticket. 05/05/2008: Gridview says that Gridview DB is fully synchronized wih GOCDB. They reassign the ticket to SAM. 15/04/2008: SAM team answered it was not notified of the ARM decision... but in ARM list of action they seem to be. |
Retention period and SD in SAM |
8 |
ARM11 |
#4 |
OCC/SAM/Gridview |
#34233 |
#7228 |
17/03/2008 |
solved (11/08/2008) |
11/08/2008: GGUS ticket solved. No more retention period. 25/06/2008: Creation of Savannah ticket #7228 by Gilles for a flag in GOCDB on decommissioned nodes. 26/05/2008: We hope to implement this later this week & test in validation. If all goes well, we should have a working solution in production later next week. 19/05/2008: Following Gridview answer ask SAM to remove retention period. Ask SAM an update of the ticket. 05/05/2008: Gridview says that Gridview DB is fully synchronized wih GOCDB. They reassign the ticket to SAM. 15/04/2008: SAM team answered it was not notified of the ARM decision... but in ARM list of action they seem to be. |
Definition/Doc in GOC |
3 |
ARM11 |
#5 |
GOC/OCC |
#33175 |
#6993 |
21/02/2008 |
unsolved (25/06/2008) |
06/08/2008: Savannah bug closed -> in production with GOCDB 3.1 25/06/2008: GOCDB Advisory Group add this item to its task list in Savannah 14/05/2008: GOCDB development back in the air. 12/03/2008: GOCDB development suspended |
ARCCE |
7 |
COD15 |
|
OCC/SAM |
#33146 |
#34248 |
21/02/2008 |
solved (30/06/2008) |
30/06/2008: Bug fixed and in production 16/05/2008: Savannah bug => "Ready for test" status 18/04/2008: ARCCE failed tests raise COD alarms -> Savannah bug updated 06/03/2008: unsolved as Savannah bug openned : SAM said it will be in production around March 20th, but Savannah bug is still opened |
Failing SAM tests due to mw bug |
10 |
Weekly ops meeting |
alarms to be closed |
ROC CERN/OCC |
[[https://gus.fzk.de/ws/ticket_info.php?ticket=45163] [#45163]],#44954 , [[https://gus.fzk.de/ws/ticket_info.php?ticket=44635] [#44635]] |
[[https://savannah.cern.ch/bugs/?46083] [#46083]] ] |
02/01/09 |
02/02/09 |
recorded here for COD use |
Alarm for uncertifed site |
13 |
18/12/09 |
Back to top
--
DavidBouvet - 05 Mar 2008