1 Operational Procedures Common Sections

1.1 Preface

The EGEE Operational Procedures Manual (OPS Manual) defines the procedures and responsibilities of the various parties involved in the running of the EGEE infrastructure, namely the resource centers (also referred as 'sites') consisting of local support and sites administrators, the staff of the Regional Operations Centres (ROCs) like the ROC Manager and the ROC support staff, the regional operations team consisting of the Regional Operator on Duty (ROD) and the 1st Line Support and the oversite grid monitoring operators (also referred to as 'C-COD'). The OPS manual is currently structured in two separate documents, one detailing the procedures and responsibilities of the regional operations team and its ROC, the other detailing the ones for the C-COD team members. To avoid dispersing the same information on multiple documents, and to allow for an easier update of the information, it was decided to have a third separate document containing the sections that are of relevance to both manuals, the OPS 'common sections' manual. Sections of this document will be included in the relevant sections of the two other manuals using the twiki INCLUDE mechanism.

The OPS Manual for the regional operations related to ROCS and SITES can be found in the following link:

https://twiki.cern.ch/twiki/bin/view/Trash/EGEEOperationalProceduresforROCsAndSitesObsolete09042010

The OPS Manual for the regional operations team can be found in the following link:

https://twiki.cern.ch/twiki/bin/view/Trash/EGEEOperationalProceduresforROCsAndSitesObsolete09042010

https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalProceduresforRegionalCOD https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalProceduresforRegionalOD

The OPS Manual for C-COD can be found in the following link:

https://twiki.cern.ch/twiki/bin/view/Trash/EGEEDRAFTOperationalProceduresforCOD

https://twiki.cern.ch/twiki/bin/view/Trash/EGEEOperationalProceduresforCCODObsolete09042010

The above procedures can be also found in EDMS at: https://edms.cern.ch/document/840932

The above procedures can be also found in EDMS at: https://edms.cern.ch/document/840???

Readers of one of the manuals above are encouraged to also read the other one in order to have a clear picture of daily operations within EGEE.

Please verify that the document you are using is the current release. The document can be found from:

This document does not describe future procedures. It describes the tools and procedures used currently to operate the EGEE production service.

NB: For the purpose of this version of the Operations Procedure the team formerly known as regional Operator on Duty will be called ROD and the central CIC-on-duty will be called C-COD. The old COD operations might sometimes be referred to in this document, and in that case will be known as OLD_COD. When referring to roles and meetings, the references will still be COD. The website known as the CIC portal will be called Operations portal and the regional dashboard is simply refered to as the dashboard, with some limitations according to roles.

Since there is not sufficient experience with the operation of large scale production grids we expect the document to be changed significantly and frequently.

  • Change requests come from users of the Operational Procedures Manual. Those are: Site managers, ROD, C-COD and SA1 Operational Teams
  • To request a change in the manual any interested party should open a ticket via GGUS specifying the "Type of problem" as: "ROD and C-COD Operations";
  • According to the type of change requested the following procedure is followed:
    • Significant changes in procedures and subsequent updates are discussed at C-COD/ARM meetings, which occur quarterly. These requests will be dealt with according to their priority and level of effort required to make the changes.
    • For urgent changes, proposals have to be discussed on the C-COD and ROC-managers mailing lists and agreed at the following Weekly Operations Meeting by a large majority of operational people.
    • When agreed and validated these changes are implemented and the procedure coordinator will release a new version of the document.
    • New versions of the Operations Manual that contain changes in the procedures will be BROADCASTED to all sites and ROCs via the broadcasting tool.
    • Small typographical and grammatical changes can be made without discussion or approval and published at short notice. No BROADCAST is sent for these types of changes.

1.1 Preface

Staff within SA1, responsible for the daily operations of the EGEE grid, are broken up into the following areas:

Oversite Team - C-COD (consisting of ROD representatives) and support tools developers

Regional Operations Team - ROD and 1st Line Support

Regional Operations Centre - ROC Managers, ROC support staff

Resource Centres (sites) - local support, site admins

The Regional Operations team is responsible for detecting problems, coordinating the diagnosis, and monitoring the problems through to a resolution. This has to be done in cooperation with the Regional Operations Centres to allow for a hierarchical approach and overall management of tasks.

Procedures have to be followed according to the formal descriptions to ensure predictable workflow and reduce effort duplication or no action at all.

0.1 GOC Database

The GOC Database (GOCDB) is a core service within EGEE and is used by many tools for monitoring and accounting purposes. It also contains essential static information about the sites such as:

  • site name;
  • location (region/ country);
  • list of responsible people and contact details (site administrators, security managers);
  • list of all services running on the nodes (CE, SE, UI, BDII etc.);
  • phone numbers.

Site administrators have to enter all scheduled downtimes into the GOCDB. The information provided by GOCDB is an important information source during problem follow-up and escalation procedures.

3 Getting Started

All new operations team members have to take certain steps to ensure that they can access the dashboard

The necessary steps for each role will be repeated in the following sub-sections.

There are a few recommendations before starting the actual task. The new operator needs to be familiar with

They should try to understand the errors that appear and find ones that they could not follow up if they are on duty. They can ask the C-COD mailing list or refer to the GOC Wiki pages for help. Typically, they should also read some emails which have been sent on the followup-list.

  • Go to the operations dashboard manual and work through the functionality of the site;
  • Go to the FAQ pages and read the FAQs. Here the goal is not to know the material by heart, but to get an idea of the symptoms that they will be confronted with;
  • Take a look at the ROD handover logs and previous minutes of Weekly Operations Meetings;
  • Read the Glite User Guide and try submitting a “Hello World” job. This will give them valuable background information and with some hands on experience they can tell user errors from GRID service errors more easily;
  • Ask questions. Find people who have done it before or even better join an experienced team for a few days and look over their shoulders while they do their work. Not everything can be written down;

3.1 Modes of Operations

ROD and 1st Line Support are collectively known as Regional Support, and are organised internally in each ROC. They can be run as two separate teams or be one team that covers both sets of duties. The responsibility of the 1st Line Support team may differ in each ROC. The ROC determines how the work of the 1st Line Support team is organized in conjunction with the ROD team. Suggestions for several modes of operation of 1st line support are:

  • minimal - there is almost no 1st line support, all responsibility for solving problems is left on sites and responsibilities for the project that are not implemented will be taken over by ROD.
  • passive - they react only when a request for support comes from a site.
  • pro-active - they react on requests, look at results and receive notifications from monitors and they contact sites when there is a suspicion of a problem, suggesting solutions.

3.6 Mailing lists and follow-up archive

* C-COD mailing list: project-eu-egee-sa1-c-cod-followup@cernNOSPAMPLEASE.ch internal communication for C-COD staff * COD mailing list: project-eu-egee-sa1-ciconduty@cernNOSPAMPLEASE.ch internal communication for operations staff. %GREEN *official ROD diffusion list: gridops-rod-teams@cernNOSPAMPLEASE.ch * optional dicussion list: gridops-rod-teams-discuss@cernNOSPAMPLEASE.ch * ..

3.7 Emergency contacts

Currently the GOCDB handles sites that are not committed to EGEE, but are considered part of the CERN-ROC as far as LCG/EGEE operations are concerned. CERN-ROC acts as a catch-all ROC for non-EGEE sites. Examples of this are generally outside of Europe such as India, Pakistan and USA.

Phone contacts for all staff connected to sites and ROCs are available within the GOCDB and VO contacts are available from the Operations Portal.

8 Monitoring and problem tracking tools

8.1 Monitoring

There are currently a variety of monitoring tools in EGEE/LCG2 which are used to detect problems with sites. They also provide useful information about sites. The [[https://cic.gridops.org/index.php?section=roc&page=dashboard][operations dashboard] provides links and utilises combined views of the most common monitoring tools for performing tasks. The links below are for direct access to some of the monitoring tools used in the operations dashboard.

While the above tools are used on a daily basis there are many more monitoring systems available. For a current overview the GOC page on monitoring can be consulted:

8.1.1 Service Availability Monitoring (SAM)

One of the most detailed information pages about possible problems is the Service Availability Monitoring portal (SAM Portal). On the front page it gives a selection of monitored services (CE, SE, RB, etc.), regions (multiple selection list), VOs (single selection list) and the sorting order of results. For example; CE, UKI Region and OPS VO might be selected.

After the choice is made, it displays the results of the node testing scripts which are executed on a central SAM submission UI against each node of selected service type on every site. Except for CE and gCE services, a subset of test scripts is submitted to the site as a grid job and executed on one of the worker nodes. However the result is still bound to the CE that the test job was submitted to.

The test scripts contain a number of grid functional tests. As a result, a large table called the result matrix is produced. Each row of the table represents a single site and each column gives the information about the result of a single test performed against a node (or on a worker node) at a given site. Some of the tests are considered to be critical for grid operations and some of them are not.

Any failure observed in this table should be recognised, understood, diagnosed and the follow-up process should be started. This is especially important for failures of the critical tests, which should be treated with the highest priority.

The tests are executed against all sites automatically every hour. However, during the interaction with a given site ROC, the need for test resubmission may arise. In that case, the operator should manually resubmit the test job to the selected sites (on-demand resubmission). The SAM Admin web application (SAMAP) should be used for this purpose.

The results table contains summary information about the test results (mostly OK/ERROR values). However, it is possible to see a fragment of a detailed report which corresponds to the test result by clicking on a particular result. Additionally, there are links to the GOCDB (Site Name column) and the site nodename. The detailed reports should be studied to diagnose the problems.

When investigating a particluar site clicking the node name takes you to a page showing the results of the tests against that site for the last seven days. The format is a barchart created by GRIDVIEW, clicking on a bar for a particular test will display more detailed results. (NOTE: This view is a result of a recent update of the UI, to show the SAM history for a given site. The older style view can still be accessed by selecting the "(old history pages available here)" at the top of the Gridview History graphs area.)

In some cases, the same functionality fails on many sites at the same time. If this happens it is useful to look at the sites history. If sites that have been stable for a significant amount of time (weeks) start to fail, it might be a problem with one of the central services used to conduct the tests. Some failover procedures shall be put in place in the very near future. Since the regions have developed their own specific set of tests, it may be very helpful to contact the ROCs during the process of problem identification.

By default, the SAM Portal will show only tests that were chosen by the VO to be critical (in [[https://lcg-fcr.cern.ch:8443/fcr/fcr.cgi][FCR]) and only a FCR authorised person from the VO can influence the criticality of tests. However, it is possible to configure the list of displayed tests and show non-critical ones per service type by using multiple selection list "Tests Displayed". To revert any changes in the list of displayed tests please check "show critical tests" check box and click "ShowSensorTests" button. All the remaining changes (service type, region, VO) can currently only be done from the front page.

8.1.1.1 Critical tests list definition

Grid services are tested by SAM using sensors - one sensor per service. Each sensor consists of a number of tests, some or all of which can be marked as critical. Making a test critical can be done for a number of reasons, namely

  1. to give the sensor a status, thus allowing the SAM portal to apply colour-coding and display history
  2. to generate an alarm in the [[https://cic.gridops.org/] [operations portal]] in the case that the test returns an error status
  3. to include the status of the test in site availability calculations

The status of a sensor is defined as being that of the worst of its critical tests. If no tests are marked as critical, the convention is that the sensor does not have a status. Thus, in practice, at least one of a sensor's constituent tests should be defined as critical.

The SAM portal will only display the test history if the sensor that it is part of has a status (i.e. at least one test is defined as critical). This is also a pre-condition for colour-coding the display of test results. Hence the usefulness of making some tests critical.

Critical tests for the OPS VO typically generate alarms in the dashboard, but this must be explicitly enabled by the SAM team. Tests that will generate C-COD alarms require coordination beforehand, and are the ones of greatest interest with respect to Operational Procedures.

To see which set of tests are critical for each sensor, one can use Freedom of Choice for Resources (FCR).

https://lcg-fcr.cern.ch:8443/fcr/fcr.cgi

On the FCR main page it is enough to select one or several service, the OPS VO (or the VO of interest), and to tick the box "show Critical Test set". It is important to leave the site as third criteria for the grouping order (this happens by default). The resulting page will contain a list of all the tests that are critical for that VO, relative to the chosen services.

Not all of the SAM sensors are used for site availability calculations. Currently only CE, SRMv2 and sBDII sensors are used for this purpose. For those five sensors, each critical test has a direct impact on availability calculations; changes to the set of critical tests must be agreed to by various management boards, and the procedures outlined below to define a test as critical for the OPS VO must be scrupulously followed.

More information can be obtained on the SAM wiki pages dedicated to the SAM critical tests for Availability Metrics calculations

https://twiki.cern.ch/twiki/bin/view/LCG/EGEEWLCGCriticalProbes

Note that SAM has both a Validation and Production infrastructure. These procedures only apply to the Production infrastructure, because the Validation instance is purely for SAM testing and does not generate COD alarms.

Please note that in the scope of this document we are only discussing OPS critical tests. The OPS VO tests are the basic infrastructure availability tests. Other VOs (e.g. the large HEP VOs) define implement and run additional sensors, and define their own availabilty calculations.

"Candidate critical tests requests" i.e. new tests " which raise alarms to the dashboard upon failure(s) which operators on shift should take care of" are reviewed during the quarterly COD meetings. Urgent requests can come from the ROCs directly.

Both in the case of face to face meetings or by mail, Grid OPS Procedures (aka 'Pole 2') members (project-eu-egee-sa1-cod-pole2-proc@cern.ch) take care of :

  • gathering input from people attending the COD meeting and making recommendations to SAM team,
  • checking that all the C-COD shifters activities and tools are ready to handle the new critical test.

That comes down to handling the following check-list:

  1. Gather COD feedback on the test implementation
  2. Gather COD feedback on the impact of the general activity
  3. Record feedback and final recommendations in a GGUS ticket referenced in the wiki
  4. Check availability of SAM automatic masking mechanism update
  5. Check availability of model email in the COD dashboard for the coming critical test
  6. Check availability of appropriate links in the COD dashboard for the coming critical test
  7. Check that appropriate instructions are defined in the Operations Procedure Manual - incl. In the release of the Operations Procedure Manual.

After completion of the above steps from Pole 2 in conjunction with the SAM team, the SAM team takes the final request to make a test critical to the ROC managers. Once ROC managers validate the new test to become critical for the OPS VO, it is announced to all the sites/ROCs/ CODs via a BROADCAST during Week (W-2), announced at the Weekly Operations Meetings one week later (W-1); according to all the feedback gathered in the process, these tests will be become critical on week W at the Weekly Operations Meetings time.

The above procedure is summarized in the following list:

  1. Request coming from VO/ROC/Operations
  2. Evaluation and integration by the SAM team for any objection and ROC managers are informed of the new test.
  3. Test put into action as a non-critical test with final "Error", "Warn" or "Ok" flags.
  4. Report is submitted by VO/ROC/Operations to ROC managers including details of sites failing by region.
  5. ROC reviews results with its sites.
  6. Standing item on ROC managers meeting to accept, reject or request more time. A target of two weeks should be put on this step.
  7. ROC managers accept, Weekly Operations Meeting notified. Wait one week.
  8. Broadcast to all sites with a one week count down, mention at Weekly Operations Meeting.
  9. Broadcast to all sites with one day count down.
  10. Test goes critical.

8.1.1.2 Automatic generation of alarms in SAM DB

Every time a new test result is entered in the SAM DB, the system checks if an alarm has to be triggered or updated. This process is divided in 2 steps:

1. Creation of an Alarm

The procedure inserts a new alarm in the Alarm table ONLY if:

  • the test result is ERROR, CRIT or WARN
  • the node belongs to a site which is Certified
  • the VO is 'OPS'
  • the Service name is not in ('SE', 'SRM' or 'LFC')
  • the test is critical for OPS VO
  • there is no alarm already for that test, vo and node
  • the node is not in maintenance,
At this point, the alarm is created.

2. Alarm Masking

The algorithm checks if any existing alarm has to be masked due to the new alarm. If there is one or more alarms with status='new' for the OPS VO, then, for the same node and with a testid that makes it get masked according to the alarmMaskTestDef table, they are masked.

Note: alarmMaskTestDef defines which tests failing should mask other tests.

8.1.2 SAMAP

The SAMAP tool, available from the dashboard (as a separate tab) let's the operator send on demand SAM tests. Not all test sensors are available and you will only be able to send a SAM test for a site in your region.

  • Select your region, if more than one region is shown, in the "Show Regions" list.
  • Select the sensor from the "Node type" list.
  • Choose other options where applicable. You may need to specifically select the WMS.
  • Click on the "Apply Filter" button.
  • Select the site (or sites) from the list.
  • Click on the "Submit SAM jobs" button.

The results of the SAM test will be available after some time - this can be anything up to a few hours.

  • Click on the "Refresh SAM jobs status" button to watch the progress of the submitted job(s).
  • When the Current status equals DONE, then you can publish the results by clicking on the "Publish available SAM jobs results" button.
SAMAP is useful when the Site Admin. has worked on the problem, and you would like to get a quick check of the status.

8.1.3 GIIS Monitor

The GIIS Monitor page provides the results of information system tests which are run every 5 minutes. The test does not rely on any submitted job, but rather scans all site GIISes/BDIIs to gather the information and perform so called sanity checks to point out any potential problems with the information system of individual sites. The test covers the following areas:

  • Static information published by a site: site name, version number, contact details, runtime environment (installed software);
  • Logical information about site’s components: CE, number of processors, number of jobs, all SEs, summary storage space, etc;
  • Information integrity: bindings between the CE and close SEs.

The GIIS Monitor provides an overall view of the grid; there is one large table with summary information about the sites, and color codes to mark potential problems. Additionally, for each site it provides a detailed page with textual information about the results of tests and sanity checks that have been performed, and also several plots of historical values of critical parameters like jobs, CPUs, storage space. These historical plots are useful for spotting intermittent problems with site’s information systems.

-- MichaelaLechner - 07 Jul 2009

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2011-06-16 - VeraHANSPERExternal
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback