1 Operational Procedures Manual for Central COD (C-COD)

Revision history

Comment Date Version Author
Complete revision accounting for new NAGIOS tools March 2010 2.0 Vera Hansper, Malgorzata Krakowian, Peter Gronbech
Procedures for C-COD in regional operations model 18 Sept 2009 1.0 Vera Hansper, Malgorzata Krakowian, Helene Cordier, Michaela Lechner
Split of manual into ROD and C-COD parts, including revisions 30 June 2009 0.3 Vera Hansper
First Draft of manual 4 March 2009 0.2 Malgorzata Krakowian, Vera Hansper
First merge of ROD model and COD ops manual 19 january 2009 0.1 Vera Hansper
Split from COD OPS manual 05 january 2009 0.0 Ioannis Liabotis

1.1 Preface

The EGEE Operational Procedures Manual (OPS Manual) defines the procedures and responsibilities of the various parties involved in the running of the EGEE infrastructure, namely the resource centers (also referred to as 'sites') consisting of local support and sites administrators, the staff of the Regional Operations Centres (ROCs) like the ROC Manager and the ROC support staff, the regional operations team consisting of the Regional Operator on Duty (ROD) and the 1st Line Support and the oversight grid monitoring operators (also referred to as 'C-COD'). The OPS manual is currently structured into three separate documents, one covering the responsibilities of ROCs and Sites, one detailing the procedures and responsibilities of the regional operations team and its ROC, the last detailing those for the C-COD team members. To avoid dispersing the same information on multiple documents, and to allow for an easier update of the information, there is a fourth separate document containing the sections that are of relevance to the other manuals, the OPS 'common sections' manual. Sections of this document will be included in the relevant sections of the other three manuals using the twiki INCLUDE mechanism.

The OPS Manual for regional operations related to ROCs and Sites can be found at:

https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalProceduresforROCsAndSites

The OPS Manual for the regional operations team can be found at:

https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalProceduresforROD

The OPS Manual for C-COD can be found at:

https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalProceduresforCCOD

The above procedures can be also found in EDMS at: https://edms.cern.ch/document/840932

Readers of any one of the above manuals are encouraged to also read the other manuals in order to have a complete picture of daily operations within EGEE.

Please verify that the document you are using is the current release. The document can be found from:

This document does not describe future procedures. It describes the tools and procedures used currently to operate the EGEE production service according to the operation model, defined at: https://edms.cern.ch/document/971628

Since operation of large scale production grids are not static we expect the document to be changed regularly. Major changes will be announced.

  • Change requests come from users of the Operational Procedures Manual. Those are: Site managers, ROD, C-COD and SA1 Operational Teams
  • To request a change in the manual any interested party should open a ticket via GGUS specifying the "Type of problem" as: "ROD and C-COD Operations";
  • According to the type of change requested the following procedure is followed:
    • Significant changes in procedures and subsequent updates are discussed at C-COD/ARM meetings, which occur quarterly. These requests will be dealt with according to their priority and level of effort required to make the changes.
    • For urgent changes, proposals have to be discussed on the C-COD and ROC-managers mailing lists and agreed at the following Weekly Operations Meeting by a large majority of operational people.
    • When agreed and validated these changes are implemented and the procedure coordinator will release a new version of the document.
    • New versions of the Operations Manual that contain changes in the procedures will be BROADCASTED to all sites and ROCs via the broadcasting tool.
    • Small typographical and grammatical changes can be made without discussion or approval and published at short notice. No BROADCAST is sent for these types of changes.

2 Introduction

Staff responsible for the operations of the EGEE grid, are broken up into the following areas:

  • Operations and Co-ordination Centre - OCC - top level management responsible for all operations.

  • Oversight Team - C-COD (consisting of volunteer ROD representatives).

  • Regional Operations Team(s) - ROD and 1st Line Support (can be one team or two separate teams).

  • Regional Operations Centre - ROC Managers, ROC support staff.

  • Resource Centres (sites) - local support, site admins.

The Regional Operations team is responsible for detecting problems, coordinating the diagnosis, and monitoring the problems through to a resolution. This has to be done in co-operation with the Regional Operations Centres to allow for a hierarchical approach and overall management of tasks. ROCs decide themselves on how to manage Regional Operations and whether they wish to have ROD and 1st Line Support as one team or two separate teams.

Procedures have to be followed according to the formal descriptions to ensure predictable work flow and reduce effort duplication or no action at all.

2.1 Structure of this manual

This document is based on a description of the operational procedures that were in use by the operations team at CERN in October 2004 and reflects the changes to the structure of EGEE and the outcome of the previous LCG/EGEE Operations workshops.

  • Section 2 : Introduction to manual and describes the duties and roles of the regional operations model.
  • Section 3 : Describes the actions needed for all roles to become operational within SA1; includes contact lists.
  • Section 4 : Describes duties and tasks applicable to C-COD operators.
  • Section 5 : Describes Common Monitoring tasks.
  • Section 6 : Provides a description of the OSCT- Duty Contacts work.
  • Section 7 : Provides a table of references including web addresses.

2.2 Duties of Regional Operations

Regional Operations monitor sites in their region, and react to problems identified by the monitors, either directly or indirectly, provide support to sites as needed, add to the knowledge base, and provide informational flow to oversight bodies in cases of non-reactive or non-responsive sites.

2.3 Role of Site Admin

In the scope of Regional Operations, site administrators primarily receive and react on notification of one or more incidents. They should also provide information in the site notepad, which is available on the dashboard.

2.4 Role of 1st Line Support

A team responsible for supporting the site administrators to solve operational problems. The team is provided by each ROC and requires technical skills for their work. Organization and the presence of such a team is optional. However, if this team does not exist explicitly, duties of 1st Line Support must be absorbed by the Regional Operator.

2.5 Role of Regional Operator (ROD)

A team responsible for solving problems on the infrastructure according to agreed procedures. They ensure that problems are properly recorded and progress according to specified time lines. They ensure that necessary information is available to all parties. The team is provided by each ROC and requires procedural knowledge on the process (rather than technical skills) for their work. However, if the team also encompasses the role of 1st Line Support, then the necessary technical skills will be required.

2.6 Role of Central COD (C-COD)

A small team responsible for coordination of RODs, provided on a global layer. C-COD represents the whole ROD structure at the political level. Support tools developers should interact with C-COD about the tools, especially when issues arise.

3 Getting Started

All new operations team members have to take certain steps to ensure that they can access the operations dashboard. Some choices in options may vary, depending on their role, which are further explained in the specific role duties.

The necessary steps for each role will be repeated in the following sub-sections.

There are a few recommendations before starting the actual task. The new operator needs to be familiar with

They should try to understand the errors that appear and find ones that they could not follow up if they are on duty. They can ask the C-COD mailing list or refer to the GOC Wiki pages for help. Typically, they should also read some emails which have been sent on the followup-list.

  • Read the training material composed of a general presentation of the tool and a "how-to guide" at https://edms.cern.ch/document/1015741 .This material is also available from the Operations Portal.
  • Go to the operations dashboard manual and work through the functionality of the site;
  • Take a look at the C-COD handover logs and previous minutes of Weekly Operations Meetings;
  • Read the Glite User Guide and try submitting a “Hello World” job. This will give them valuable background information and with some hands on experience they can tell user errors from GRID service errors more easily;
  • Ask questions. Find people who have done it before or even better join an experienced team for a few days and look over their shoulders while they do their work. Not everything can be written down.
  • Read/Contribute to the Operations Best Practices Wiki, moderated by the C-COD team, at https://twiki.cern.ch/twiki/bin/view/Trash/EGEEEGEEOperationsBestPractices
  • If you spot a specific operational use-case which you do not know how to handle, please fill in the operational use case wiki at https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalUseCasesAndStatus, so that the C-COD team can fix this in the operational procedure Manual or in the Best Practices Wiki.

3.1 First time for all roles

3.1.1 First time for C-COD

3.2 Modes of Operations

ROD and 1st Line Support are collectively known as Regional Support, and are organised internally in each ROC. They can be run as two separate teams or be one team that covers both sets of duties. The responsibility of the 1st Line Support team may differ in each ROC. The ROC determines how the work of the 1st Line Support team is organized in conjunction with the ROD team. Suggestions for several modes of operation of 1st Line Support are:

  • minimal - there is almost no 1st Line Support, all responsibility for solving problems is left on sites and responsibilities for the project that are not implemented will be taken over by ROD.
  • passive - they react only when a request for support comes from a site.
  • pro-active - they react on requests, look at results and receive notifications from monitors and they contact sites when there is a suspicion of a problem, suggesting solutions.

3.3 Mailing lists and follow-up archive

3.4 Emergency contacts

Currently the GOCDB also handles sites that are not committed to EGEE, but are part of the WLCG as far as EGEE/LCG operations are concerned. These sites are connected to ROCs outside of the EU: LA (Latin America), CA (Canada), IGALC (Latin America and Caribbean Grid Initiative).

Phone contacts for all staff connected to sites and ROCs are available within the GOCDB and VO contacts are available from the Operations Portal. Please see https://goc.gridops.org/user/security_contacts .

NB: Since a recent revision of the GOCDB it is now possible to download these data on a regular basis through GOCDBPI. It is advised that the operations teams download and update this list once a month.

4 C-COD

C-COD (Central COD) is a small team provided at a global layer, for the coordination of all RODs and the overall coordination of ROD services in the long term. One member of the team is on duty for a week at a time as the C-COD leader.

The C-COD team is composed of:

  • ROD representatives (from four volunteer ROCs)
  • The head of the C-COD team
  • Observers (possibly one person from each ROD)

Steps on how to become a C-COD member were described in section 3.1.1 Duties of the C-COD leader are described in section 4.2.

4.1 Dashboard

The operations dashboard for C-COD is available at https://operations-portal.in2p3.fr/ and is accessed via the "C-COD view" tab.

A HOWTO guide for the operation's Regional Dashboard and an overview presentation can be found at https://edms.cern.ch/document/1015741 .

4.1.1 Monitoring

There are currently a variety of monitoring tools in EGEE/LCG which are used to detect problems with sites. They also provide useful information about sites. The operations dashboard https://operations-portal.in2p3.fr/ provides links and utilises combined views of the most common monitoring tools for performing tasks. The links below are to some of the tools used by operators.

There is a video available describing the transition from old style SAM to the Nagios based system.

Section 5 provides some general information about the monitoring tools and tests.

4.2 Duties

All duties listed in this section are mandatory for C-COD.

4.2.1 Acting on tickets escalated to C-COD

C-COD duties are meant to be kept minimal, thus a limited number of cases should be escalated to the C-COD layer. Definition of cases which are automatically transfered to C-COD:

  • Alarms older than 3 days without an assigned ticket
  • Tickets which have expired 3 days ago
  • Tickets which have not been solved for 30 days
  • Tickets transferred to C-COD (last escalatation step)
  • Sites in downtime for more than 1 month

In each case the C-COD leader should contact the related ROD and try to solve the problem. If the problem cannot be solved at the regional level then the case should be transfered to the next Weekly Operations Meeting.

NOTE: C-COD do not handle general tickets and alarms. They only contact the relevant ROD on expired cases.

4.2.2 Supervise RODs work

One of the vital duties of C-COD is to supervise the work of all RODs and ensure the smooth functioning of regional teams by overseeing and coordinating their work. With the aid of global ROD metrics, C-COD can monitor the status of all RODs' work to detect any irregularity and respond in such cases by creating GGUS tickets or by directly emailing to the relevant ROC Support Unit.

C-COD metrics are available under the "View C-COD Metrics ..." link on the C-COD dashboard. 'Regional and global assessment of operations' reports based on C-COD metrics are created monthly, and can be found at https://edms.cern.ch/document/1020727.

4.2.3 Participate in Weekly Operations Meetings

The C-COD leader should be present as the C-COD representative at the Weekly Operations Meeting following their duty week. If it is necessary they should raise all issues which couldn't be solved at the C-COD level. The C-COD leader should fill out the handover log well before the meeting, which then also automatically sends the information to the Weekly Operations Meeting people with issues to raise. After the meeting, C-COD should send an outcome email back to the C-COD mailing list. Additionally the C-COD leader should pay attention at the Weekly Operations Meeting actions list for C-COD assigned actions and report progress if needed.

4.2.4 Deal with GGUS tickets assigned to the C-COD SU

C-COD as a representative of the whole ROD structure on the political level should deal with tickets assigned to C-COD in GGUS and also in actions at the Weekly Operations Meetings.

4.2.5 Handling global operational problems

In the scenario of a global operational service problem, in addition to any EGEE broadcast concerning that service, C-COD should advise RODs on how to handle Alarms and or tickets related to that service.

4.2.6 Removing problematic sites

The ROC must suspend one of their sites in either of the following two cases:

  • For problems that are not addressed by either the site or the ROC: C-COD will apply an “escalation step procedure” described in the Workflow and Escalation Procedure section. The final step is suspension, and the site is taken out of the grid resources. For the site to be suspended, a given ROC would have to disregard answering several emails over a period of more than two weeks and not join the Weekly Operations Meeting when asked to. As soon as the site’s status is modified, the ROC should be sent another notification mail.

  • If a site is in downtime for more than one month: Before suspending the site the ROC should investigate the situation to exclude the possibility that the site will be up again in a short period of time. (i.e. within a few days.)

Before a site is suspended, the site's ROC should make contact with the senior people in the federation, the site and C-COD. If a site later indicates its readiness to rejoin the infrastructure, it will need to go through the certification procedure again before its status is set to in production.

4.2.6.1 Emergency suspension

A site may be suspended directly in emergency cases, e.g. security incidents. The escalation procedure is then by-passed and either the ROC, ROD or the C-COD may set the site in suspension in agreement with the ROC's corresponding ROD. (In general, ROD can place a site in downtime if it is either requested by the site, or ROD sees an urgent need to put the site into downtime.)

Under exceptional circumstances, C-COD may suspend a site immediately without going through all the steps of the escalation procedure. The most important scenario for this is when a security incident occurs, and efforts by C-COD to contact the site's ROD or the site directly are unsuccessful. It is preferable that in such cases that the site and ROD are contactable, in which case the site can close down its resources and ROD can issue an immediate downtime. (A similar scenario is applicable for ROD where ROD is not able to contact the site in question.)

In all situations it is important that communication channels between all parties involved are active, and that C-COD (or ROD) inform the ROC and its site that the suspension has occured.

4.2.7 Workflow and escalation procedure

This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The escalation procedure is a procedure that operators must follow whenever any problem related to a site is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached.

Moreover, the procedure is supposed to introduce a hierarchical structure and responsibility distribution in problem solving which should lead to significant improvement in the quality of the production grid service. Consequently, minimizing the delay between the steps of the procedure is of utmost importance. The regular procedure the operators follow can be considered in four phases.

  • submitting problems into the problem tracking tool after they are detected using monitoring tools or by a task created by a regional operations team (ROC);
  • updating the task when a site state changes which can be detected either by a comparison of the monitoring information with the current state of the task in the problem tracking tool, or by input from a ROC;
  • closing tickets or escalating outdated tickets when deadlines are reached in the problem tracking tool;
  • initiate last escalation step and/or communication with site administrators and ROCs.

Below are the detailed steps of the “escalation procedure” if no response is received for the notification of a problem or the problem has been unattended for:

Step [#] Max. Duration [work days] Escalation procedure
1 3 When an alarm appears on the ROD dashboard (>24 hours old): 1st mail to site admin and ROC
2 3 2nd mail to site admin and ROC; At the end of this period escalate to C-COD
3 5 Ticket escalated to C-COD, C-COD should in that week, act on the ticket by sending email to the ROC, ROD and site for immediate action and stating that representation at the next weekly operations meeting is requested. The discussion may also include site suspension.
4 (IF no response is obtained from either the site or ROC) C-COD will discuss the ticket at the FIRST Weekly Operations Meeting and involve the the Operation and Coordination Center (OCC) in the ticket
5 5 Discuss at the SECOND weekly operations meeting and assign the ticket to OCC
6 Where applicable, C-COD will request OCC to approve site suspension
7 C-COD will ask ROC to suspend the site

NB : Theoretically, the whole process could be covered in a 2-3 week period. Most often a site is either suspended on the spot for security reasons, or the problem is solved (or the site and the ROC reacts) well before operators need to escalate the issue to C-COD, who then determines whether to bring it to the Weekly Operations Meetings.

NB: After the first 3 days, at the 2nd escalation step, if the site has not solved its problem, ROD should suggest to the site to declare downtime until they solve the problem and the ROC should be notified. If they do not accept the downtime then C-COD will proceed with the regular escalation procedure at the agreed deadlines.

4.2.8 Collate knowledge sharing contributions

One of the activities of C-COD is to promote knowledge sharing on all operational matters. C-COD should collect and propagate the experience between site administrators and supporters at the project level. C-COD should review the contributions and ensure convergence of regional practices.

To achieve this goal, ROD and site admins should make contributions to the Best Current Practices twiki.

4.2.9 Upgrade OPM

The C-COD team is responsible for upgrading the Operations Procedure Manual in conjunction with the ROD community. All changes should be added according to procedure described in section 1.1.

4.2.10 Make recommendations on criticality of tests

The C-COD team is the unit which makes recommendation on the criticality of monitoring tests on behalf of the ROD community.

4.3 Handover log model

The C-COD leader is obliged to send a summary (handover) via the HANDOVER tool on the dashboard at the end of the week to the next C-COD leader on duty. This also contains the information needed for the weekly WLCG meeting.

An example handover template:

Handover from (old C-COD leader) to (new C-COD leader)

Issues raised during the week:

   * ROD: number of alarms/tickets not handled by the ROD

Issues pending:

ROD:
   * site: reason why site appears on the C-COD dashboard and last status

Other issues:
   * Did you encounter any tickets that changed 'character' ? (i.e. no longer a simple incident that can easily be fixed, but rather a problem that may result in a Savannah bug) -- means that the use-cases wiki has to be updated.
   * Any alarms that could not be assigned to a ticket (or masked by another alarm)?
   * Any tickets opened that are not related to a particular alarm
   * Anything else the new leader should know?
   * Instructions received from recent Weekly Operations Meeting (only for the leader taking over)

Weekly Operations Meeting:
   * List unresponsive sites: note name of Site and ROC, as well as GGUS Ticket number and reason for escalation
   * Report any problems with operational tools during shift
   * Did you encounter any issues with the  C-COD procedures, Operational Manual?
   * Report encountered problems with grid core services
   * Any Savannah/GGUS tickets that need more attention to a wider audience?

After the Weekly Operations Meeting the C-COD leader should send a summary to the C-COD mailing list with relevant information for the next leader on shift. Again, this should be done via the HANDOVER tool on the operations dashboard.

4.4 Communication Lines

For internal communications, C-COD uses the project-eu-egee-sa1-c-cod-followup@cernSPAMNOTNOSPAMPLEASE.ch mailing list, specially created for this purpose: This mailing list is also a place for direct contact with C-COD for other units.

C-COD and other RODs can communicate with RODs in two ways. The former is through the ROD representative in C-COD and the later via direct email to the ROD mailing list. (project-eu-egee-sa1-cic-on-duty@cernSPAMNOTNOSPAMPLEASE.ch)

Contact list of individual RODs:

There is also the handover section in the ROD dashboard which allows for C-COD and RODs to intercommunicate.

In the case of a non-responsive ROD, C-COD should create a GGUS ticket to appropriate ROC Support Unit.

C-COD communicates with other units like OCC, Weekly Operations Meeting, developers, ROC Managers, etc. mainly by emails and GGUS tickets.

5 Monitoring and problem tracking tools

There are currently a variety of monitoring tools in EGEE/LCG which are used to detect problems with sites. They also provide useful information about sites. The operations dashboard provides links and utilises combined views of the most common monitoring tools for performing tasks. The links below are for direct access to some of the monitoring tools used in the operations dashboard.

5.1 Monitoring

The links below are for direct access to some of the monitoring tools.

While the above tools are used on a daily basis there are many more monitoring systems available. For a current overview the GOC page on monitoring can be consulted:

5.1.1 NAGIOS Availability Monitoring (MyEGEE)

One of the most detailed information pages about possible problems is the MyEGEE Portal, https://sam-%INSERT_YOUR_ROC%-roc.cern.ch/myegee (1). On the front page it also gives two options, Resource Summary and Status History. There is also a link to the help pages for that portal on the right hand side of this page. The more useful option for operations is the status history.

Status History:

In MyEGEE you can view the status of a site and service according to different profiles by selecting the relevant profile in the 'Profile Selection' drop-down menu. For operational purposes, the ROC_OPERATORS profile should be used. By careful manipulation of the parameters on the right side of the Status History page, you can select the sites and resources of interest to you for viewing. It will be useful to familiarise oneself with this tool by reading the relevant help pages at https://sam-%INSERT_YOUR_ROC%-roc.cern.ch/myegee/help.html .

Two options are possible for further analysis:

  • clicking on the history bar at or near a point of failure,
  • selecting the table tab.

(Note, it is useful to select a shorter date range at this point, or select the Current SAM Status (2) option.) Any failure observed here should be recognised, understood, diagnosed and the follow-up process should be started. This is especially important for failures of the critical tests, which should be treated with the highest priority.

The two options mentioned above provide an abbreviated status of the test results from the probes. They also contain detailed information about each test results by selecting the Show Detail widget of a particular result. The detailed reports should be studied to diagnose the problems.

(1) Please replace the whole string %INSERT_YOUR_ROC% with your ROC name in capitals, i.e. UKI or SWE.

(2) Please note that SAM in this context does not refer to the old SAM tests.

5.1.1.1 ROC_OPERATORS profile

In the Nagios system there is a set of profiles corresponding to groups of probes instead of a single list of critical tests. A profile is therefore a mapping from a service type to a list of metrics (tests) e.g. for a CE there are about 10 metrics gathered.

A profile may be used for different functionality requirements e.g. for raising notifications to operators (ROC_OPERATORS profile), or for the calcuation of availability (ROC_CRITICAL profile - note that this will be renamed to ROC_AVAILABILITY). The list of tests in the ROC_OPERATORS profile for Nagios can be found here: https://twiki.cern.ch/twiki/bin/view/LCG/SAMCriticalTestsForCODs .

Over time more profiles will be created for specific tests, or tests which will be applied to specific sets of resources e.g. MPI tests, which should only run on resources which support MPI.

Currently the old SAM results are still used for site availability calculations. More information can be obtained on the (old) SAM wiki pages dedicated to the (old) SAM critical tests for Availability Metrics calculations: https://twiki.cern.ch/twiki/bin/view/LCG/EGEEWLCGCriticalProbes

5.1.1.1.1 Procedure to add a test to the ROC_OPERATORS profile

Tests can be added to the ROC_OPERATOR profile so that they generate an alarm in the Operations Dashboard when the tests return an error status.

This change to the ROC_OPERATOR profile must be agreed upon by various management boards, and the procedures outlined below to define a test as critical for the OPS VO must be scrupulously followed.

Please note that in the scope of this document only OPS critical tests are considered.

"Candidate critical tests requests" i.e. new tests which will raise alarms on the dashboard upon failure and which operators on shift should handle are reviewed during the quarterly ROD Forum meetings. Urgent requests can come from the ROCs directly.

In face to face meetings, regular phone meetings or by mail, Operations Procedures (aka 'Pole 2') members (currently project-eu-egee-sa1-cod-pole2-proc@cernNOSPAMPLEASE.ch) will:

  • gather input from people attending the ROD Forums meeting and make recommendations to the Nagios team,
  • check that all the relevant activities and tools are ready to handle the new critical test.

This can be summarised in the following check-list:

  1. Gather ROD feedback on the test implementation;
  2. Gather ROD feedback on the impact of the general activity;
  3. Record feedback and final recommendations in a GGUS ticket referenced in the wiki;
  4. Check that the template mail in the dashboard accounts for the new critical test;
  5. Check availability of appropriate links in the dashboard for the new critical test;
  6. Check that appropriate instructions are defined in the Operations Procedure Manual.

Once ROC managers validate the new test to become critical for the OPS VO, it is announced to all the sites/ROCs/C-CODs via a BROADCAST during Week (W-2), and announced at the Weekly Operations Meetings one week later (W-1). Based on all feedback gathered in the process, and that they have been approved, these tests will be become critical on week W at the Weekly Operations Meetings time.

The above procedure is summarized in the following list:

  1. A request for a new test comes from VO/ROC/Operations
  2. The Nagios team evaluate and integrate the test. ROC managers are also officially informed of the new test.
  3. The test put into a candidate status to be available on the dashboard as a non-critical test, but still with final "Error", "Warn" or "Ok" flags.
  4. A report is submitted by VO/ROC/Operations to ROC managers on the progress of the test. This should include details of sites failing the test (by region).
  5. ROCs review results with their sites until 75% of sites pass the new tests.
  6. A standing item on the ROC managers meeting agenda is made to accept, reject or request more time until the test is validated. A target of two weeks should be put on this step.
  7. Once the ROC managers accept the test, the acceptance is noted at the Weekly Operations Meeting. Wait one week.
  8. A broadcast is sent to all sites one week before the test is set as critical. This is also mentioned at the Weekly Operations Meeting.
  9. A Broadcast is sent to all sites one day before the test is set as critical.
  10. Test is set as critical.

5.1.1.2 Alarm generation in the Operations Dashboard

Alarm generation in the operations dashboard relies on checks of service notifications. The ROC_OPERATORS profile (see the section on ROC_OPERATORS profile above) defines the services that are checked. The decision to send out notifications is made in the service check and host check logic. Host and service notifications occur in the following instances:

  • When a hard state change occurs
  • When a host or service remains in a hard non-OK state and the time specified by the </notification_interval/> option in the host or service definition has passed since the last notification was sent.

Information on state types and hard state changes can be found at the following URLs:

The actual alarm creation is made through a filter, and if the service state changes, this is registered by the filter, stored in a data base and displayed on the dashboard.

5.1.2 GIIS Monitor

There is a new GIIS monitor at CERN: the GStat 2.0 Monitor. It is used to display information about grid services, the grid information system itself and related metrics.

GStat 2.0 does not rely on any submitted job, but rather scans all site GIISes/BDIIs to gather the information and perform sanity checks to point out any potential problems with the information system of individual sites. The tool covers the following areas:

  • Static information published by a site: site name, version number, contact details, runtime environment (installed software);
  • Logical information about site’s components: CE, number of processors, number of jobs, all SEs, summary storage space, etc;
  • Information integrity: bindings between the CE and close SEs.
  • Geographical Information
  • LDAP query view
  • Lists the top level bdiis status
  • Useful statistics

The GStat 2.0 provides many different views of the grid; The Site View is similar to that provided by the old GSTAT tool. Filters can be applied to select by Country, EGEE_ROC, Grid, VO or WLCG_Tier, to select the sites of interest. A table is shown with summary information about the sites, including its status and stats for numbers of CPUs, Storage, and jobs. Each site is provided with an individual page with more detailed information about the results of tests and sanity checks that have been performed, and also several plots of historical values of critical parameters. These historical plots are useful for spotting intermittent problems with site's information systems.

Further information about the architecture of the new system.

5.2 Problem tracking

5.2.1 GGUS

In order to keep track of the follow-up process, all operators have to submit each detected problem to a problem tracking tool. The current problem tracking tool is the Global Grid User Support (GGUS) tool, based on Remedy, and run by FZK. It has been in use for grid operations since mid-April 2005. The GGUS ticketing system has two available interfaces:

  • The operations dashboard, which should only be used by ROD to report new problems and assign them to ROCs;
  • The generic GGUS interface, which should be used by ROCs once tickets are assigned to them. This may be via the local helpdesk if there is a local GGUS interface.

5.2.1.1 Creating tickets

Problem categorization in EGEE/LCG operations of a single problem at each individual site and the associated follow up process is represented by a single ticket in GGUS. In order to organise and categorise the tickets the following structure has been put in place:

  • Ticket: each ticket represents a problem at a single site;
  • Category: identifies a site by the site name;
  • Priority: used internally to mark sites with higher operational importance according to the number of provided resources (big sites - high priority);
  • Item group: represents a problem type, eg. LCG Version

The task fields which are introduced above describe the individual problems in terms of location (site), type, and importance.

Task field usage: apart from the information provided by the fields which were introduced above, there are a number of fields which describe the details of the problem and current status in terms of escalation procedure. These fields should be utilised as follows:

  • Should be Finished on - the deadline for the current escalation step;
  • Assigned to - ROCs are currently responsible for their particular problem;
  • Last action taken - last action which was taken according to the escalation procedure;
  • Person contacted - the name, email address and possibly phone number of the person who was contacted in the last action;
  • Response - a summary of communication with the person responsible for the site in the last action;
  • Summary - a summary of the problem, it is highly recommended to put the affected host’s name in the first part of the summary;
  • Original submission - the original error message plus any comments that may be useful for problem identification and solving.

NB: Please note that the ticket should not expire during the weekend.

6 Security Items and Daily Operations

6.1 Security matters

Specific security issues or questions can be sent to project-egee-security-support@cernSPAMNOTNOSPAMPLEASE.ch. This list contains a list of generic security contacts in each ROC, who will provide a reply. As it is rather difficult to ensure that GGUS tickets are readable only by the affected parties, one should refrain from using GGUS to track operational security issues.

All sites are bound by the JSPG policies listed at http://cern.ch/osct/policies.html. The policies cover many areas, including:

  • "You shall immediately report any known or suspected security breach or misuse of the GRID or GRID credentials to the incident reporting locations specified by the VO and to the relevant credential issuing authorities. The Resource Providers, the VOs and the GRID operators are entitled to regulate and terminate access for administrative, operational and security purposes and you shall immediately comply with their instructions." (Grid Acceptable Use Policy, https://edms.cern.ch/document/428036)

  • "Sites accept the duty to co-operate with Grid Security Operations and others in investigating and resolving security incidents, and to take responsible action as necessary to safeguard resources during an incident in accordance with the Grid Security Incident Response Policy." (Grid Security Policy, https://edms.cern.ch/document/428008/)

  • "You shall comply with the Grid incident response procedures and respond promptly to requests from Grid Security Operations. You shall inform users in cases where their access rights have changed." (Virtual Organisation Operations Policy, https://edms.cern.ch/document/853968/)

6.2 OSCT organization and OSCT-Duty Contact Role

The Operational Security Coordination Team (OSCT, http://cern.ch/osct) provides an operational response to security threats against the EGEE infrastructure. It focuses mainly on computer security incident handling, by providing reporting channels, pan-regional coordination and support. It also deals with security monitoring on the Grid and provides best practices and advice to Grid system administrators.

The OSCT is led by the EGEE Security Officer and includes security contacts from each EGEE region. The OSCT Duty Contact (OSCT-DC) role is assigned on a weekly basis to one of the ROC Security Contact persons who provide support for daily security operations, according to a schedule defined by the OSCT.

The OSCT-DC must perform the following actions:

1. All reported incidents should be coordinated by the ROC Security Contact of the site reporting the incident (although this responsibility MAY be delegated in the region), but the OSCT-DC must ensure a timely coordination of the incident effectively happens, which includes assuming responsibility to coordinate himself/herself in an appropriate time frame if needed. More details about the role of the incident coordinator are available in the EGEE incident response procedure.

2. Ensure that appropriate action is taken in a timely manner (often by the affected ROC) to solve ticket assigned to the GGUS Security Management unit or to any matter being raised on the team's mailing list.

3. If necessary, attend the weekly OPS meeting and report or follow-up as appropriate within the OSCT

4. Send a report to the OSCT list, before lunch time on the Monday following the duty week, containing a summary of the issues of the week and those that are carried over from a previous week. This handover should also be carried over the weekly security operations phone meeting to enable the current and next OSCT-DC to discuss open issues.

5. Write new (or update) 'best-practice' items for the security RSS feed, based on advisories from GSVG (items should be sent to project-egee-security-officer@cernSPAMNOTNOSPAMPLEASE.ch to be published) If the issue posted by GSVG requires urgent action, raise it at the weekly OPS meeting and send a specific EGEE broadcast after consulting the rest of the OSCT

6. Monitor CA updates and monitor the update process as described in

A backup OSCT-DC is defined in the same manner whose role is expected to cover situations such as prolonged unexpected network outage with the Lead site.

6.3 Security incidents handling and Interaction with OSCT-DC

OSCT- Duty Contacts (OSCT-DC) are members of the OSCT group that provide security coordination at a given week in tandem with the C-COD team. OSCT-DC has to be ready to respond to tickets created on security matters that given week.

If a security incident is suspected, the EGEE incident response procedure, which is available and maintained at http://cern.ch/osct/incident-reporting.html, being an excerpt from the full procedure https://edms.cern.ch/file/867454/2/EGEE_Incident_Response_Procedure.pdf, must be used.

When a security related ticket is created by a site, VO or user, the OSCT-DC should then assign the ticket to the Security Support unit. They then follow-up the issue according to the OSCT regulations summarized above, taking into consideration the restricted information and contact details they have access to.

Security incidents are handled via the communication channels described on the procedure and must not be discussed via GGUS. If a security incident is initially discovered via GGUS, communication with the affected parties must be switched to the appropriate channels and the existing ticket should be closed and only used later as a reference.

However, 1st Line Support may have to act rapidly according to OSCT-DC instructions depending on the severity of the incident. 1st Line Support should also inform ROD about the situation. ROD can then act as a follow up contact.

6.4 Grid Security Vulnerability Handling

Reason for a Vulnerability handling process

A lot of care is taken to ensure that Grid software and its deployment is secure, however from time to time, Grid Security Vulnerabilities are found. Grid Security Vulnerabilities are problems in the software or deployment which may be exploited in order to create an incident. Such problems need to be resolved in a timely manner in order to prevent incidents, and, while this is happening, it is important that they are not advertised to potential hackers. Hence, within EGEE a process has been established for reporting and handling vulnerabilities which is run by the EGEE Grid Security Vulnerability Group (GSVG).

Reporting a Vulnerability

If a possible vulnerability is found an e-mail should be sent to

grid-vulnerability-report@cernSPAMNOT.ch

Alternatively, the issue may be entered as a "bug" in the GSVG savannah at https://savannah.cern.ch/projects/grid-vul/ for obvious reasons these bugs are set to "private" so only a limited number of people can browse them.

Note that if a vulnerability has been exploited, it is an incident and the incident handling procedure should be followed.

Handling of issues reported.

The GSVG will investigate the issue, and inform the reporter of the findings. If the issue is found to be valid then the Risk Assessment Team will place the issue in one of 4 Risk categories, each which has a corresponding Target Date.

Risk Categories

* Extremely Critical - Target Date = 2 days

* High - Target Date = 3 weeks

* Moderate - Target Date = 3 months

* Low - Target Date = 6 months

It is then the responsibility of the appropriate development team to produce a patch. This issue will be kept private until either a patch is issued to resolve it, or the target date is reached.

On resolution or reaching the Target Date

An advisory will be released. Hopefully, in most cases a patch will be issued in time for the Target Date. Advisories are now placed at http://www.gridpp.ac.uk/gsvg/advisories/. Advisories should reference the release, and release notes reference the advisory.

Further Details

Details of the issue handling process is available in the document entitled 'The Grid Security Vulnerability Group - Process and Risk Assessments for Specific Issues' at https://edms.cern.ch/document/977396/1 .

More information on the Grid Security Vulnerability Group is available at http://www.gridpp.ac.uk/gsvg/.

Other notes

If the issue is found to be operational, rather than being due to a software bug, then an advisory will be set to the OSCT. Please refrain from discussing vulnerabilities on open mailing lists, logged mailing lists, or reporting them as 'bugs' in open bug reporting systems.

The GSVG was setup to primarily handle vulnerabilities and improve the security in EGEE gLite Middleware, and does not handle vulnerabilities in operating systems, or in non-Grid software. However, if such issues are reported to the GSVG they will attempt to pass information onto the relevant party.

7 References

[1] CIC Portal
[2] Operations Dashboard
[3] GOCDB
[4] Global Grid User Support
[5] GIIS Monitoring pages
[6] Availability report page
[7] SAM Nagios Documentation page
[8] Generic Nagios Documentation (CGI)
[10] EGEE broadcast tool
[11] Dashboard HOWTO Guides
[12] Best Practices
[13] Intervention Procedures& Notification Mechanism
[14] Site/ROC Association document

Edit | Attach | Watch | Print version | History: r33 < r32 < r31 < r30 < r29 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r33 - 2011-06-16 - VeraHANSPERExternal
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback