Technical procedures for candidate ROCs during start-up

The CERN ROD (ROC on duty) Rota

The ROC on duty shifts is a duty of every ROC. It consist in actively monitoring the sites in the region following the escalation procedure defined in EGEE. It basically amounts to connecting to the dashboard, reviewing the current alarms and raising tickets to the sites.
The central COD team is responsible for monitoring the activities of the ROC on duty shifts (aka ROD), and will issue warnings and eventually report the ROC to the OPS meeting if they fail to review the alarms and assign alarms to tickets within 3 days.
Hands-on training under one operational ROC is part of the standard procedure to become a new ROC, to be started as soon as possible.

Familiarity with the procedures and the dashboard is essential and a proven record of a successfully completed training period is a necessary condition to authorize a new ROC to operations.

Prerequisite to start as a ROD at CERN is to complete the check-list of registration to various systems.
The procedures to follow are completely described by the Operational Procedures document.

Below we very shortly sketch a couple of ROD-related topics (the use of the CIC dashboard and the handling of GGUS tickets)

A short (and totally informal) introduction to using the CIC dashboard (the "active" work)

Basically you log in the Dashboard:
  • ROC section on the left tabs,
  • Regional dashboard on the left menu
and check the various tabs on the main section: the most important ones are tickets, alarms, and dashboard

gives you an overview

gives you the overview of which tickets are expiring

  • review of expired and open tickets

useful to check if there are alarms older than 72hours or approaching the limit

this is the tab that we use the most

  • the alarm to see which alarms are still relevant (errors)
  • alarm that are not relevant (OK) can be switched off
  • for the alarms that are relevant a ticket to the site can be opened by clicking on the magnifying glass on the left of the site name
  • when a ticket is opened you can select the subset of alarms and text that are relevant and add your own comments

if the alarm is very young sometimes it is sensible to wait a few hours (if the impact is moderate) just in case it is a temporary glitch and not a real problem.

Closing the GGUS tickets (the "reactive" work)

What explained before (to go around an chase the problems at your sites) is half of the work of the ROD. There is a "passive" work to be done as well that is: check the tickets assigned to your ROC in GGUS and address them. BTW, whenever you open a ticket in the dashboard there is a corresponding one opened in GGUS.

So it's like having a double soul. With one hand (as the ROD) you open the tickets to yourself, with the other (as the ROC support) you follow them up in GGUS. As "reactive" moreover, you have to follow up tickets against your ROC opened by the generic users as well.

Now, GGUS tickets for CERN ROC are handled in a very special way, using an internal interface with the CERN TT system (PRMS, based on Remedy). Basically when a ticket is assigned to the CERN ROC in GGUS a corresponding PRMS ticket is opened and assigned to a special group defined in PRMS (called ROC Cern). Then there are several cases. Main are:

a) the ticket is for the CERN-PROD site --> the ROC forwards ticket is forwarded to a service manager at CERN --> the GGUS ticket is set automatically to "in progress" b) the ticket is for an EXTERNAL (non cern) site --> the ROC notifies the site via an e-mail feature of PRMS and the corresponding PRMS ticket stays in the inbox of the CERN ROC (in remedy) waiting to be closed (in progress in GGUS). c) wrong assignment --> ticket redirected in GGUS and closed in PRMS

a,b,c require the ROD to act.

The interaction with the PRMS system is complicated (especially for operators outside CERN) and not very relevant (except as an example) for the sake of creating a new ROC. However it is important to allow the new ROC to practice also in the "reactive" role.

We suggest this general approach: When the candidate ROC people is on duty they plays both "active" and "passive" ROC roles, meaning that they maintain the overall control of the dashboard and GGUS. Our team at CERN, instead, will play the role of CERN-PROD. So in practice we will close the circle when needed using PRMS, but we will pretend not to see tickets before they are actually assigned to us (CERN-PROD) and to EXTERNAL sites, which we expect the candidate ROC to do via GGUS (e.g. via the e-mail notification)

In other words the candidate ROC will be fully responsible of the transition assigned --> in progress in GGUS. We won't take action in this transition. But we will take care of cleaning up the PRMS system that unavoidably gets fed by the activity in GGUS.

Install Nagios

Please refer to the procedure from LA ROC

Sign up SLA with sites

Please refer to the procedure from Cern ROC

This topic: LCG > WebHome > LCGGridDeployment > CERNROC > CERNROCIncubator > ProceduresForCandidateROCs
Topic revision: r2 - 2009-11-15 - unknown
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback