Author: Operations Coordination Centre
First Published: DRAFT
Last Update: 2009-02-16 by AntonioRetico
Last Content Review: 11-Feb-2009
Expiration: 31-Oct-2009 (after this date please contact the author and ask for a review)
Guidelines and procedures to manage critical incidents in production caused by middleware updates
This document is addressed to EGEE release managers and operations managers involved in the process of releasing the gLite middleware. For sake of brevity they are from now on referred to as Release Manager and Operations Manager.
The document contains operational guidelines and procedures to be used to evaluate and handle issues in production caused by a malfunctioning of the middleware. In particular it provides hints and methods to manage critical incidents, meaning with critical those incidents which may require the adoption of emergency procedures like roll-back or fast-tracked releases.
The handling of security incidents is out of the scope of this document. This case is covered by the the Grid Security Incident Handling and Response Guide issued by the Joint Security Policies Group (JSPG ) .
It is worth pointing out though that within the follow-up of a security incident sometimes fast-tracked releases and/or roll-back of the middleware in production may be required, in which case the relevant procedures referenced in this document may be applied. In that case however the Release Manager and Operations Manager act according to the directives of the JSPG and they don't take active part in the decision process, unless they are requested.
Introduction
In a complex infrastructure like a world-scale grid incidents in production due to middleware malfunctioning/misconfiguration happen every days. For example, during the year 2008 an average of 2 issues per day were reported within the WLCG production service which could be associated to bugs affecting the integrated middleware distribution.
Fixes for bugs found in production are normally provided and deployed within the standard gLite middleware release process. Sometimes though the severity of the issue forces the release and operations managers to take exceptional measures in order to restore the service such as for instance a roll-back to a previous version of the middleware or a quick deployment of middleware patches. The application of such exceptional measures is clearly not desirable because they introduce stress both for service providers and users and they translate into wasted work cycles (in the case of roll-back) or into the hazardous roll-out in production of quickly tested software (fast-track).
Management of critical incidents
Critical incidents affecting the gLite middleware distribution are handled according to the diagram below
EPC diagram: Management of critical issues by Release and Operations Managers:
Basically, as soon as a critical incident is reported, a task force is quickly organised and gathers in an urgent meeting in order to evaluate the severity and to decide appropriate actions. following the decision of this meeting four alternative measures may be taken:
In the first three cases a [[][post-mortem document]] of the incident has to be produced.
The next paragraphs explain this process in detail.
How the incident is reported and the process is triggered.
The official way to report ANY issues affecting the production system is GGUS. The GGUS helpdesk operators (Ticket Process Managers and Regional Operations Centres) who process all the service tickets entering the support system are in a good position to recognise the severity and the general impact of one or more issues reported. If an issue happening in production is rated to be of critical severity and it can be related to a problem in the middleware distribution the helpdesk operators should make sure that the
Operations Coordination Centre (OCC)
and/or the
gLite Release Team
are duly informed. It has to be expected though that in some cases incident reports may be forwarded to the OCC or to the gLite release team via non-standard channels (e.g. private e-mails, phone calls etc.). However the information is conveyed to the OCC or to the gLite release team, it is responsibility of the Release Manager and Operations Manager to make sure that a GGUS ticket associated to the incident is properly opened and updated to track significant events concerning the incident.
Incident handling task-force and meeting
Goals of the meeting
To evaluate the severity of the reported incident
To make a decision about the follow-up
To define the action list
NOTE: It is worth pointing out here that if an emergency meeting is taking place there is likely someone somewhere that is loudly screaming because his/her work is affected. This is human and understandable. Among the raisons d'être of this meeting and of the whole incident management process though there is to somehow encourage cold-bloodiness and to avoid critical decisions to be made by a single or few persons under the influence of highly emotional pressure, which may finally turn out to have undesired side effects on the whole production infrastructure and users.
Participants
The meeting is called and chaired either by the Release Manager or by the Operations Manager. They both participate to the meeting and they eventually invite appropriate technical experts for consulting. The "experts" may be e.g.:
Senior service managers
Selected representatives o possibly impacted user communities
Appropriate documentation in support to the discussion and the subsequent decision should be brought to the meeting by the Release Manager and Operation Manager. In particular this should include
The (mandatory) GGUS ticket plus any relevant messages/reports/logs eventually produced by users and operators.
As many input as possible to help in the completion of the check-list described in the next paragraph.
A post-mortem report of the incident can be started and used from the beginning to track e.g. the outcome of the meeting.
Agenda: 1. Evaluation of severity
The severity of an incident in production is hard to evaluate because its perception is different according to the perspective. For example, an incident preventing a single user to work is clearly perceived as critical by that user but it may result to be a secondary issue in the perspective of the whole service infrastructure.
The incident handling task force is called to evaluate the severity in an objective way and to take appropriate corrective measures if needed.
There is no deterministic recipe on how to evaluate the severity of an incident. Experience with the production system plays of course a good role. Here a check list is proposed with questions about the incident which is a useful exercise to try and answer to before making a decision. In most cases the right evaluation and decision may be produced as a natural consequence of the replies.
Is there a GGUS ticket open tracking the issue? → If not it should be opened in short order and subsequent decision should be transcribed there.
Is it a live issue or a potential one? → Sometimes a bug is discovered in production before it has become a real issue for the users of the middleware. In that case the likeliness of the issue to "explode" should be evaluated and weighted, taking also into account the elapsed time since the introduction of the bug in the production system.
Which VOs? → In order to reply to this question and the ones above it may be useful to consult the Resources Distribution tool on the CIC Portal (certificate needed) where the mapping between sites, services and VOs is reconstructed based on the information held in the topology database and information system.
How many users in that VO?
issue affecting sites (and indirectly VOs)
How many sites?
Which sites?
Which LCG tiers? → Mapping between sites and LCG Tiers can be found out in Gridview
Does the issue affect the site availability/reliability? → Gridview
Which services?
How critical services? → this depends mostly on usage. Four web pages give a list of services which are recognised by the four LHC VOs to be critical, together with some indications of the maximum allowed unavailability interval by each LHC experiments (ALICEATLASCMSLHCb)
Agenda: 2. Decision about follow-up
After the severity of the incident is evaluated the task force has to decide whether to do a corrective action (roll-back, information broadcast, fast-tracked release) or just to "do nothing".
Here are some general considerations that could help in the decision.
The general principle to be observed while deciding what to do is to introduce the slightest possible perturbation in the day-to-day operations. Therefore the preferred way of fixing the issue should be
Do nothing
Send an information broadcast
Do a roll-back
Do a fast-tracked release
The decision to "do nothing" could be justified in some cases. For example, if a problem is recognised to be a middleware distribution problem but the issue is circumscribed to one site with very special settings and it is not impacting on the oters, a good solution would be to act locally by suggesting a workaround if available or by rolling-back the installation if possible. Once the operational issue is fixed locally, the cause would be reported as a bug and treated with the due priority within the standard release process.
There are many cases where an information broadcast sent to the site/users communities is enough to deal with the incident.
In general these cases can be all referred to incomplete or wrong documentation associated to the release (e.g. release notes, installation and configuration, user guide). In that case, after having corrected the wrong parts of the documentation, a message should be broadcast explaining the changes and pointing out e.g. the need for extra configuration actions to be performed at the sites.
This approach can be followed as well in the particular case of the introduction in production of a new service/feature. Immediately after its deployment a new service/feature can be assumed to have no active users. If the introduction of the broken new service/feature in production is recognised to be harmless nor to have side-effects on existing then a good choice is to just remove it from the release documentation, or mark it as broken and broadcast a message in order to notify the user communities and give info about what turns out to be a delay in the deployment of the new service/feature.
If none of the previous solutions is applicable an intervention at the level of the software distribution is needed then the default choice should be the roll-back to the previous version. The roll-back solution ceases to be applicable if the problem was not introduced by the latest middleware version released but it had been latent in the production system and only at a certain point it "exploded".
A fast-tracked release should be decided in ultima ratio (only if non of the previous solution is applicable) and only in case of very serious disruptions to the service and its users. E.g. a security fix for a disclosed vulnerability affecting a version of the middleware earlier than the latest one will normally have to be dealt with a high-priority release which works exactly as a fast-tracked release. This is however in the scope of the JSPG processes, as mentioned before.
A different example case (already happened) where a roll-back would not solve the problem is represented by updates of external packages not controlled nor distributed by gLite but on which the gLite middleware depends. If the application of such an update in production causes issues in production that fall in the category of the critical incidents then a fast-tracked release of the necessary fixes may be unavoidable.
Whatever decision is made about the technique to follow-up the critical incident, it is joint responsibility of the Operations Manager and Release Manager to make sure that this decision made is appropriately documented and motivated on the tracking GGUS ticket.
Agenda: 3. Definition of action list
An action list to detail the corrective actions to be undertaken has to be appropriately defined and transcribed on the tracking GGUS ticket, under the joint responsibility of the Operations Manager and Release Manager. Typically the first-level action list will contain items either for the Release Manager or for the Operations Manager e.g.:
roll-back the release → Release Manager
issue a broadcast to production → Ops Manager
start a fast-tracked release → Release Manager
close the incident → Ops manager and/or Release Manager
In the following section the methods to face these different responses to the incident are described.
Incident response methods
Roll-back
The Operations Manager sends a broadcast via the CIC portal broadcast tool to inform the production sites that a problem has been found with the release and a roll-back is needed. The aim of this broadcast is to prevent sites which have not yet performed the update from doing it. A template is available here.
The Release Manager makes sure that everything is correctly set-up for the roll-back (repositories, documentation, ...). Then he/she sends a note to the sites via the CIC portal broadcast tool instrumenting the roll-back details. A template for this note is available [[][here]]
The Operation Manager links the GGUS ticket with the incident report to the GT-PROD ticket used to track the update
The Operation Manager and the Release Manager complete the post-mortem report of the incident.
technical procedure missing
Fast-tracked Update
The fast-tracked release is initiated by the Release Manager according to this technical procedure
The Operation Manager and the Release Manager complete the post-mortem report of the incident.
Issue Special Broadcast
The Release Manager operates the necessary changes to the documentation pages.
The Operations Manager sends a dedicated braodcast to the production system using the CIC Portal broadcast tool.. The template provided contains basically just the addresses and a tentative plot of the content, the content being very much dependent on the observed issue.
The Operation Manager and the Release Manager complete the post-mortem report of the incident.
Do nothing
"Do nothing" means that the responsibility of following the incident up is transferred back from the "incident management task force" to the relevant support unit as defined in the standard GGUS workflow.
The Operations Manager and the Release manager are requested to re-assign the GGUS ticket conveniently
Templates
Template for pre-announcing a roll-back to the production service (using the CIC Portal broadcast tool)
TO:
CIC-on-Duty, OSG, ROC managers, VO managers, Production site admins,
support-eis@cern.ch,glite-announce@cern.ch
SUBJECT:
Announce of roll-back from gLite 3. x UPDATE nn
MESSAGE:
Dear members of the EGEE Grid Production Service,
Because of a malfunctioning affecting the production grid recently reported and tracked with the GGUS ticket
URL to GGUS ticket the OCC and the gLite Middleware Release Team have decided to roll-back
the middleware version distributed with gLite 3. x UPDATE nn
-----------------------------------------------------------------------------
-- give here a brief summary of the issue taken from the GGUS ticket --
-----------------------------------------------------------------------------
-- REMOVE IF DOCUMENTATION READY --
The repository are being adjusted for the roll-back and the relevant release documentation is being updated.
Very soon a notification will be sent to all the sites where the update was applied to start the roll-back procedure.
Those sites that have not upgraded yet are not concerned by this message and the following one.
We deeply apologise for any inconveniences.
Best regards,
The OCC and the gLite Middleware Teams
Template for a special announce related to an incident occurred in the production service (using the CIC Portal broadcast tool)
TO:
Publish in the CIC Portal news, CIC-on-Duty, OSG, ROC managers, VO managers,
Production site admins, Preproduction site admins ,glite-announce@cern.ch
SUBJECT:
IMPORTANT INFO: gLite 3. x UPDATE nn: [short description]
MESSAGE:
Dear members of the EGEE Grid Production Service,
This message concerns the sites that have upgraded to gLite 3.x UPDATE nne
-- REMOVE IF NOT RELEVANT --
and more specifically the sites supporting the VO(s) ... the sites running ... as batch system/storage element/back-end
----------------
A malfunctioning affecting the production grid was recently reported and tracked with the GGUS ticket URL to GGUS ticket.
-----------------------------------------------------------------------------
-- give here a brief summary of the issue taken from the GGUS ticket --
-----------------------------------------------------------------------------
As a result of the analysis done by the experts it turned out that:
-- EXAMPLE --
the configuration can be fixed locally by manually changing the configuration value XXX
----------------
[...] other
-- REMOVE IF NOT RELEVANT --
The proposed changes should be applied [immediately|urgently|at the sites' earliest convenience|with low priority|within the DD-MMM-YYYY]
----------------
----------------
-- REMOVE IF NOT RELEVANT --
Sites that have not upgraded yet are recommended to apply the following configuration change
----------------
All the configuration instructions given with this message have been documented in the Release Notes
in the section concerning the gLite UPDATE in the subject.
All details of the update can be found in:
http://glite.web.cern.ch/glite/packages/R3.x/updates.asp
(services supported on 32b architectures)
and
http://glite.web.cern.ch/glite/packages/R3.x/x86_64/updates.asp
(services supported on 64b architectures)
Those sites that have not upgraded yet are not concerned by this message
We deeply apologise for any inconveniences.
Best regards,
The OCC and the gLite Middleware Teams
Template for a post-mortem report.
---++ !One-line description of the incident
---+++ Description
short description here
---+++ Time line of the incident
Detailed, including references to external documents if available
---+++ Analysis
That should be mostly the outcome of the task force meeting
---+++ Followup
Actions taken and measures to prevent the incident to happen again
---+++ Acknowledgements
If any
-----------------------------------