Comment | Date | Version | Author |
---|---|---|---|
Complete revision accounting for new NAGIOS tools | March 2010 | 2.0 | Vera Hansper, Malgorzata Krakowian, Peter Gronbech |
Procedures for Regional Operations in regional operations model | 18 Sept 2009 | 1.0 | Vera Hansper, Malgorzata Krakowian, Helene Cordier, Michaela Lechner,Ioannis Liabotis, Peter Gronbech |
Split of manual into ROD and C-COD parts, including revisions | 30 June 2009 | 0.3 | Vera Hansper |
First Draft of manual | 4 March 2009 | 0.2 | Malgorzata Krakowian, Vera Hansper |
First merge of ROD model and COD ops manual | 19 january 2009 | 0.1 | Vera Hansper |
Split from COD OPS manual | 05 january 2009 | 0.0 | Ioannis Liabotis |
Duties of 1st Line Support | Requirements |
---|---|
Receive incident notification from sites in the scope | Mandatory |
Respond to site support requests | Optional – present in “passive” mode |
Contact a site for incidents which are not being tackled | Optional – present in “pro-active” mode |
Assist a site in solving incidents | Mandatory |
Pass information to the ROD team via the dashboard "site notepad" or through an existing GGUS ticket for that site | Mandatory |
Handle incidents less than 24h old | Mandatory |
View incidents older than 24h | Optional |
Modify any GGUS ticket body up to the “solved by ROC” status | Mandatory |
Close incidents for “solved problems” | Mandatory |
Create entries for the knowledge base | Mandatory |
Create tickets to C-COD for core or urgent matters | Optional: submitted through ROD for validation |
Duties of ROD | Requirements |
---|---|
Receive incident notification from sites in the scope | Mandatory (if not handled by 1st Line Support) |
Handle incidents less than 24h old | Mandatory (if not handled by 1st Line Support) |
Create tickets for alarms older then 24h and that are not in an OK state | Mandatory |
View incidents older than 24h | Mandatory |
Escalate tickets to C-COD if necessary: assignment to C-COD can be made directly through the dashboard. | Mandatory |
Propagate actions from C-COD down to sites | Mandatory |
Monitor and update any GGUS tickets up to the “solved by ROC” status (via the dashboard) | Mandatory |
Close incidents for “solved problems” | Mandatory |
Create entries for the knowledge base | Mandatory |
Handle the final state of the incident: i.e "closed by ROD" once the ROD has verified that the solution provided at the "solved by ROC" level is correct and appropriately documented. | Mandatory |
Put the site in downtime for urgent matters | Optional |
Create tickets to C-COD for core or urgent matters | Mandatory |
Mail Info: From: Regional Operator <rodcontactemail> To: <sitecontactemail>, <roccontactemail> Cc: <rodcontactemail>, GGUS helpdesk Ticket info: Subject: <problem> at <sitename>The mail has to contain:
The general template for the mail is as follows: Dear Site Admins and ROC Helpdesk, We have detected a problem at <sitename> ---------------------------------------------------- *org.sam.CREAMCE-JobSubmit-ops* is failing on : <nodename> -------------------------------------------------------------------------------------------- Failure detected on : <date> View failure history and details on NAGIOS portal : https://samnagXXX.cern.ch/nagios/cgi-bin/avail.cgi?host=<sitename>&service=org.sam.CREAMCE-JobSubmit-ops&show_log_entries View some details about the test description : https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics#CREAM_CE -------------------------------------------------------------------------------------------- Additional comments or logs for alarm org.sam.CREAMCE-JobSubmit-ops-test-lcgce.uibk.ac.at#4883 -------------------------------------------------------------------------------------------- Could you please have a look ? Thank you <Regionname> - Regional Operator team Link to Ticket :<GGUS_url>
Step [#] | Max. Duration [work days] | Escalation procedure |
---|---|---|
1 | 3 | When an alarm appears on the ROD dashboard (>24 hours old): 1st mail to site admin and ROC |
2 | 3 | 2nd mail to site admin and ROC; At the end of this period escalate to C-COD |
3 | 5 | Ticket escalated to C-COD, C-COD should in that week, act on the ticket by sending email to the ROC, ROD and site for immediate action and stating that representation at the next weekly operations meeting is requested. The discussion may also include site suspension. |
4 | (IF no response is obtained from either the site or ROC) C-COD will discuss the ticket at the FIRST Weekly Operations Meeting and involve the the Operation and Coordination Center (OCC) in the ticket | |
5 | 5 | Discuss at the SECOND weekly operations meeting and assign the ticket to OCC |
6 | Where applicable, C-COD will request OCC to approve site suspension | |
7 | C-COD will ask ROC to suspend the site |
"You shall immediately report any known or suspected security breach or misuse of the GRID or GRID credentials to the incident reporting locations specified by the VO and to the relevant credential issuing authorities. The Resource Providers, the VOs and the GRID operators are entitled to regulate and terminate access for administrative, operational and security purposes and you shall immediately comply with their instructions." (Grid Acceptable Use Policy, https://edms.cern.ch/document/428036)
"Sites accept the duty to co-operate with Grid Security Operations and others in investigating and resolving security incidents, and to take responsible action as necessary to safeguard resources during an incident in accordance with the Grid Security Incident Response Policy." (Grid Security Policy, https://edms.cern.ch/document/428008/)
"You shall comply with the Grid incident response procedures and respond promptly to requests from Grid Security Operations. You shall inform users in cases where their access rights have changed." (Virtual Organisation Operations Policy, https://edms.cern.ch/document/853968/)