4.1 GOC Database
The GOC Database (GOCDB) is a core service within EGEE and is used by many tools for monitoring and accounting purposes. It also contains essential static information about the sites such as:
- site name;
- location (region/ country);
- list of responsible people and contact details (site administrators, security managers);
- list of all services running on the nodes (CE, SE, UI, RB, BDII etc.);
- phone numbers.
Site administrators have to enter all scheduled downtimes into the GOCDB. The information provided by GOCDB is an important information source during problem follow-up and Escalation procedure.
4.1.3 Removing problematic sites
There is a site status in GOCDB called
suspended. When this status is selected the site is automatically removed from the top-level BDII and its monitoring is turned off. The site is now out of the grid. This status has to be cautiously used by ROCs, as it is specific to sites for which tickets have been raised and not solved on time, i.e. it is the last escalation step in the procedure of daily operations that the COD apply.
COD may also suspend a site, under exceptional circumstances, without going through all the steps of the escalation procedure. For example, if a security hazard occurs, COD must suspend a site on the spot in case of such an emergency.
Consequently, for the normal course of operations, when a ROC needs to take one of its sites out of production, it should go through the site downtime scheduling procedure described below. For example, a ROC may want to use the status suspended when one of its sites has to be permanently or indefinitely taken out of production. Note that when a site modifies any field of its site form within the GOCDB, its ROC should receive a notification mail.
4.1.6 Emergency contacts
Currently the GOCDB handles sites that are not committed to EGEE, but are considered part of the CERN-ROC as far as LCG/EGEE operations are concerned. CERN-ROC acts as a catch-all ROC for non-EGEE sites. Examples of this are generally outside of Europe such as India, Pakistan and USA.
Phone contacts for all staff connected to sites and ROCs are available within the GOCDB and VO contacts are available from the Operations Portal.
6 Monitoring Sites
6.1 Critical tests list definition
Grid services are tested by SAM using sensors - one sensor per service. Each sensor consists of a number of tests, some or all of which can be marked as critical. Making a test critical can be done for a number of reasons.
- to give the sensor a status, thus allowing the SAM portal to apply colour-coding and display history
- to generate an alarm in the CIC portal in the case that the test returns an error status
- to include the status of the test in site availability calculations
The status of a sensor is defined as being that of the worst of its critical tests. If no tests are marked as critical, the convention is that the sensor does not have a status. Thus, in practice, at least one of a sensor's constituent tests should be defined as critical.
The SAM portal will only display
the test history if the sensor that it is part of has a status (i.e. at least one test is defined as critical). This is also a pre-condition for colour-coding the display of test results. Hence the usefulness of making some tests critical.
Critical tests typically generate alarms in the CIC dashboard, but this must be explicitly enabled by the SAM team. Tests that will generate COD alarms require coordination beforehand, and are the ones of greatest interest with respect to Operational Procedures.
Not all of the SAM sensors are used for site availability calculations.
Currently only CE, SE, SRM, SRMv2 and sBDII sensors are used for this purpose. For those five sensors, each critical test has a direct impact on availability calculations; changes to the set of critical tests must be agreed to by various management boards, and the procedures outlined below must be scrupulously followed.
Note that SAM has both a Validation and Production infrastructure. These procedures only apply to the Production infrastructure, because the Validation instance is purely for SAM testing and does not generate COD alarms.
Please note that in the scope of this document we are only discussing OPS critical tests. The OPS VO tests are the basic infrastructure availability tests. Other VOs (e.g. the large HEP VOs) are free to run additional sensors, and define
their own availabilty calculations.
"Candidate critical tests requests" i.e. new tests " which raise alarms to the dashboard upon failure(s) which operators on shift should take care of" are reviewed during the quarterly COD meetings. Urgent requests can come from the ROCs directly.
In both cases (f2f meeting or by mail) Pole 2 takes care of :
- gathering input from COD attendance and making recommendations to SAM team,
- checking that shifters activities are ready to handle the new critical test.
That comes down to handling the following check-list:
- Gather COD feedback on the test implementation
- Gather COD feedback on the impact of the general activity
- Record feedback and final recommendations in a GGUS ticket referenced in the wiki
- Check availability of SAM automatic masking mechanism update
- Check availability of model email in the COD dashboard for the coming critical test
- Check availability of appropriate links in the COD dashboard for the coming critical test
- Check that appropriate instructions are defined in the Operations Procedure Manual - incl. In the release of the Operations Procedure Manual.
After completion of the above steps from Pole 2 in conjunction with SAM,
the SAM team takes the final request to make a test critical to the ROC managers. Once ROC managers validates the new test to become critical for the OPS VO, it is announced to all sites/ ROCs/CODs via
a BROADCAST during Week (W-2), announced at the Weekly Operations Meetings
one week later (W-1); according to all the feedback gathered in the mean time, these tests will be turned on critical on week W at the Weekly Operations Meetings time.
The above procedure is summarized in the following list:
- Request coming from VO/ROC/Operations
- Evaluation and Integration by the SAM team for any objection and ROC managers are informed of the new test.
- Test put into action as a non-critical test with final "Error", "Warn" or "Ok" flags.
- Report is submitted by VO/ROC/Operations to ROC managers including details of sites failing by region.
- ROC reviews results with its sites.
- Standing item on ROC managers meeting to accept, reject or request more time. A target of two weeks should be put on this step.
- ROC managers accept, Weekly Operations Meeting notified. Wait one week.
- Broadcast to all sites with a one week count down, mention at Weekly Operations Meeting .
- Broadcast to all sites with one day count down.
- Test goes critical.
6.2 Service Availability Monitoring
(SAM)
One of the most detailed information pages about possible problems is the Service Availability Monitoring portal (SAM Portal).
At the front page it gives a selection of monitored services (CE, SE, RB, etc.), regions (multiple selection list), VOs (single selection list) and
the sorting order of results.
After the choice is made, it displays the results of the node testing scripts which are executed on a central SAM submission UI against each node of selected service type on every site.
Except for CE and gCE services, a subset of test scripts is submitted to the site as a grid job and executed on one of the worker nodes. However the result is still bound to the CE that the test job was submitted to.
The test scripts contain a number of grid functional tests. As a result, a large table called the result matrix is produced. Each row of the table represents a single site and each column gives the information about the result of a single test performed against a node (or on a worker node) at a given site. Some of the tests are considered to be critical for grid operations and some of them are not. Any failure observed in this table should be recognised, understood, diagnosed and the follow-up process should be started. This is especially important for failures of the critical tests, which should be treated with the highest priority.
(NOTE: A recent update of the UI has provided a different view of the SAM history for a given site. The view described above can still be accessed by selecting the (old history pages available here) at the top of the Gridview History graphs area.)
The tests are executed against all sites automatically every
2 hours. However, during the interaction with a given site ROC, the need for test resubmission may arise. In that case, the operator should manually resubmit the test job to the selected sites (on-demand resubmission).
The SAM Admin web application should be used for this purpose.
The results table contains summary information about the test results (mostly OK/ERROR values). However, it is possible to see a fragment of a detailed report which corresponds to the test result by clicking on a particular result. Additionally, there are links to the GOCDB (Site Name column). The detailed reports should be studied to diagnose the problems.
In some cases, the same functionality fails on many sites at the same time. If this happens it is useful to look at the sites history. If sites that have been stable for a significant amount of time (weeks) start to fail, it might be a problem with one of the central services used to conduct the tests. Some failover procedures shall be put in place in the very near future. Since the regions have developed their own specific set of tests, it may be very helpful to contact the ROCs during the process of problem identification.
By default, the SAM Portal will show only tests that were chosen by the VO to be critical (in FCR) and only an
FCR authorised person from the VO can influence the criticality of tests. However, it is possible to configure the list of displayed tests and show non-critical ones per service type by using multiple selection list "Tests Displayed". To revert any changes in the list of displayed tests please check "show critical tests" check box and click "ShowSensorTests" button. All the remaining changes (service type, region, VO)
can currently only be done from the front page.
6.3 GIIS Monitor
The GIIS Monitor page provides the results of information system tests which are run every 5 minutes. The test does not rely on any submitted job, but rather scans all site GIISes/BDIIs to gather the information and perform so called sanity checks to point out any potential problems with the information system of individual sites. The test covers the following areas:
- Static information published by a site: site name, version number, contact details, runtime environment (installed software);
- Logical information about site's components: CE, number of processors, number of jobs, all SEs, summary storage space, etc;
- Information integrity: bindings between the CE and close SEs.
The GIIS Monitor provides an overall view of the grid; there is one large table with summary information about the sites, and color codes to mark potential problems.
Additionally, for each site it provides a detailed page with textual information about the results of tests and sanity checks that have been performed, and also several plots of historical values of critical parameters like jobs, CPUs, storage space. These historical plots are useful for spotting intermittent problems with site's information systems.
7.5 Workflow and escalation procedure
This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The escalation procedure is a procedure that operators have to follow whenever any problem related to a site is detected. The main goal of the procedure is to keep track of the whole problem follow-up process and keep the process consistent from the detection until the time when the ultimate solution is reached.
Moreover, the procedure is supposed to introduce a hierarchical structure and responsibility distribution in problem solving which should lead to significant improvement
in the quality of the production grid service. Consequently, minimizing the delay between the steps of the procedure is of utmost importance.
The regular procedure the operators follow can be considered in four phases.
- submitting problems into the problem tracking tool after they are detected using monitoring tools or by a task created by a regional operations team (ROC);
- updating task when a site state changes which can be detected either a comparison of the monitoring information with the current state of the task in the problem tracking tool, or by input from a ROC;
- closing tickets or escalating outdated tickets when deadlines are reached in the problem tracking tool;
- last escalation step and/or communication with site administrators and ROCs.
Below are the detailed steps of the "escalation procedure&" if
no response is received for the notification of a problem or the problem has been unattended for:
Step [#](i) |
Deadline [days](j) |
Escalation procedure COD Action Label |
1 |
3 |
1st mail to site admin and ROC |
2 |
3 |
2nd mail to ROC and site admin |
3 |
<5 |
3rd mail to ROC and site admin stating that this will go forward to the next weekly operations meeting for discussion and that representation at the meeting is requested. The discussion may also include site suspension. |
4 |
Discuss at the next weekly operations meeting |
5 |
Request OCC approve site suspension |
6 |
Ask ROC to suspend site |
NB : Theoretically, the whole process could be covered in a 3 weeks period. Most often a site is either suspended on the spot for security reason, or the problem is solved (or the site and the ROC reacts)
well before
operators need to escalate the issue to the
Weekly Operations Meetings.
Detection, diagnosis and problem tracking tools were described in previous sub-sections. The integration of these steps in the process is summarised below as are the details of the escalation procedure workflow.
NB : On the schema below the second level support that COD may have to do is accessed through mailing list and GGUS portal, i.e when other support units re-assign some tickets to the grid experts (current COD). For that purpose, second-level support is referred as dashed.
NB: After the first 3 days, at the 2nd escalation step, COD suggest to the site to declare downtime until they solve the problem and the ROC has to be notified. If they do not accept the downtime then the COD proceed with the regular escalation procedure at the agreed deadlines.
Most of the COD tasks can be dealt with from the operations dashboard available on the Operations portal. The use of the operations dashboard is presented in section 6.
8.4 Sites that fail SAM tests but still continue to be operational
Operators need to set the ticket to “unsolvable” or set a long term deadline. A mail
must be sent to the COD mailing list. These sites should not be candidates for suspension.
8.8 Ticket handling during Weekends
Due to the fact that weekends are not considered working days it is noted that COD teams do not have any responsibilities during weekends and that CODs should take care that tickets do not expire during weekends.