Authors: Antonio Retico(SA1), Oliver Keeble(SA3), Maite Barroso(SA1), Diana Bosio(SA1), Steve Traylen(SA1), Nick Thackray(SA1), Maarten Litmaath (WLCG)
First Published: 22-Jun-2009
Last Update: 2010-01-06 by Main.unknown
Last Content Review: 19-Jun-2009
Expiration: 1-Mar-2010 (after this date please contact the author(s))

Staged roll-out of grid middleware: general lines

Scope and goals of this document

This document describes the general lines of the process to roll-out middleware updates to the production service in a controlled way. The implementation of such a process was deemed necessary by both the EGEE and WLCG Management in order to protect the production system against incidents caused by undetected flaws in the UMD middleware.

The process is conceived for the application in the context of the EGI infrastructure as described in [1]

The aim of this document is to provide a common reference to SA1 and SA3 developers in charge of further specification of the process in its technical aspects as well as of its implementation.

Staged roll-out: process overview

Ownership

This process was jointly developed within the EGEE-III project by the SA1 and SA3 activities, which retain ownership.

At the end of the EGEE-III project the ownership will be transferred to EGI.eu

Players

  • The EGI.eu Middleware Unit (MU) as defined in [2].
  • The EGI.eu Operations Unit (OU) as defined in [2].
  • The Product Team as defined in [2].
  • The Early Adopter Site (EA) : A production site that commits with EGI to quickly install the new update and provide feedback

Assumptions

The following assumptions were used based on (based on [2]) and the work in progress of the UMD working group.
  1. MW products developed and tested by independent Product Teams
  2. Release to EGI (production) → SW released by Product Teams into beta repository
  3. One separate repository per product (node type)
  4. Dependencies between components managed separately
    • E.g. the products WMS and LFC could use 2 different versions of VOMS. Everything needed is included in the product repository

Workflow

The Product Team X releases a new update (version N+1) in the beta repository

For the product team this step is "the" release to production.

The staged roll-out is an operational process that is managed by OU and MU and it is transparent for the product team.

The MU announces the release of Product X version N+1 to the Early Adopter Sites

Early Adopters sites are a club of production sites that have committed to EGI to provide this service.

They are known and registered with the OU for this task (OPT-IN approach).

Communication and announcements to EA sites happen through special channels (e.g. mailing lists) where the availability of the update N+1 and the links to SW repository and release notes are posted.

The middleware release pages for v.N+1 are prepared and made publicly accessible but not yet as the default. The default release pages stop at Update N (latest stable release). Links to N+1 pages are provided for sites (not officially registered as Early Adopters) that want to update at their own risk

Early Adopter Sites are expected to update the relevant services and to report on failures within a time period specified by an SLA

The SLA is defined by EGI and maintained by EGI.eu OU

The SLA defines

  • acceptable timelines for installation at the EA sites
  • acceptable timelines for the EA sites to provide negative feedback
  • nature and quality of the feedback

The MU makes available to the EA sites:

  • the software at version N+1 through a special ("Next") SW repository
  • the release notes for version N+1
  • links to the version N+1 as an option in the default release pages

Issues eventually experienced by the EA sites are to be reported via the standard channels (e.g. GGUS, Savannah)

The release is rolled-out to the wide production service after a quarantine period

The quarantine period may be different for different products. The duration of the quarantine periods per product are defined by EGI and maintained by the EGI.eu OU

If serious issues are found by the EA during the quarantine period the release may be rejected. The owner of this decision is the OU. In that case, the update N+1 is removed from the special (next) repository and the EA sites will be requested to roll-back. Support for roll-back may be requested from the MU.

The quarantine is not a compulsory waiting time: sites that desire can skip the waiting time and proceed at their own risk by using links to the new release notes that will be publicly available

Exceptions

The flow of MW updates coming from the UMD through the EA sites towards the production sites as described above is generally serial, meaning that Update N+2 of product X cannot be deployed while the staged roll-out of N+1 is pending.
Update N+2 can reach the production first only if Update N+1 is obsoleted.

Exceptions to this rule will be supported (e.g. critical security patches to be deployed quickly on the full production system) . These exceptions would normally be managed by increasing the release number of version N (e.g. 2.1.4-1 → 2.1.4-2)

Failure Management

In case the process described above fails and critical software problems are introduced to production despite the staged roll-out, it is the responsibility of the MU to provide methods to be applied at the affected sites in order to restore the previous operational conditions.

Implementation notes

Here we present some implementation details which were agreed upon while developing the process described above.

Repositories

The staged roll-out process is supported by two operational repositories per product, here conventionally indicated as "Current" repository (containing the last stable version) and "Next" repository, containing the new packages which will be object of the staged-roll-out.

Both repositories, as well as the corresponding links to the release notes are maintained by the MU on behalf of the OU

The way the "current" and "next" repositories are meant to be used by the production sites (Early Adopters or not) is described in the picture below.

Use of "Current" and "Next" MW repositories by Early Adopters vs. standard production sites.
repositories.png

The "current" production repository must be always in a consistent state, meaning that every site not interested in the early deployment exercise can at any time safely install from the "current" repository the version advertised as last known stable version

Related Documents

  1. EGI Blueprint - EGI_DS project
  1. EGI: Managing the Software Process - Steven Newhouse et al.
  1. SA1 proposal and requirements for staged-roll-out of middleware updates - Maite Barroso, Antonio Retico
  1. Minutes of EGEE SA1/SA3 brainstorming sessions on staged roll-out
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng repositories.png r1 manage 56.3 K 2009-06-19 - 17:41 AntonioRetico  
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2010-01-06 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback