Template for Grid Service intervention (DRAFT)

Aim:

To cover (step by step) actions before/during/after interventions on the main Grid services that run at the CERN site. The level of detail per service should be agreed at LCGSCM and be saved in an agreed place.

These could then be used by the SMOD and GMOD anytime that a future intervention is scheduled, hopefully simplifying the announcement procedure and at the same time providing clearer information to the users.

Definitions

Transparent intervention: Short service interruption, during which the middleware and/or client software should keep retrying. We should be equipped to make interventions of the following types, transparent:
Adding new resources to an existing service;
Replacing hardware used by an existing service;
Operating system / middleware upgrade / patch;
Similar operations on DB backend (where applicable).

Long intervention: One for which avoiding an interruption perceivable by the users/sites was not possible / implemented.

Pathological cases: Unforeseeable 'accidents' for which announcement templates can help fast information spreading:

Massive machine room reconfigurations, as was performed at CERN (and elsewhere) to prepare for LHC;
Wide-spread power or cooling problems;
Major network problems, such as DNS failures.

Parametres

Not all services will require this level of detail, e.g. the "Hostnames involved" below may not be explicitely mentioned in the announcement. Nevertheless this has to be agreed with the service!.

  • Type of intervention: short/long (!!! Think whether every service should have a different flow per Type )
  • Event description:
  • Intervention date, time, duration:
  • Coordinator:
  • People involved:
  • Service name:
  • Hostnames involved:
  • Netops availability needed: yes/no
  • Operators' action needed: describe action/no
  • Alarms to de-activate during intervention: List
  • LCGSCM trace: Link minutes when Done
  • GMOD Name: Select the right week to find the GMOD
  • Broadcast frequency: one week before &/OR one day before &/OR at the intervention start/end Template for announcements

Services (yours or affected)

  • Oracle Intervention templates here
  • LCG
  • FTS
  • LFC
  • VOM(R)S (concerns VOs: ALICE, ATLAS, CMS, LHCb, DTEAM, OPS, Sixt, Unosat, Geant4)
  • SAM
  • Gridview
  • Shiva
  • FCR
  • LHCb LFC
  • CMS
    • Phedex
  • ATLAS
    • LCG RBs
    • gLite WMS
    • UIs
    • SE
  • ALICE
    • PDB (Oracle Physics DataBase)
    • VO boxes
    • LCG RBs
    • gLite WMS
    • gLite CEs
    • LCG CEs
    • SE (Castor)

Sites and their status

https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceLog/Weekly.htm

Announcement templates

Find your service in this list and contact us about the standard text.

This script will generated automatically broadcast info to the right bodies and frequency. Please test it at your next service intervention!!

Permanent contact list

    • Miguel Anjo - Physics databases
    • Jacek Wojcieszuk - PS DBA team
    • Gavin McCance - FTS support
    • Digambar Sonvane - GridView support
    • Miguel Coelho Dos Santos - LFC support
    • Remi Mollon, Cc: Maria Dimou - VOMS support
    • Lanxin Ma, Cc: Remi Mollon - VOMRS support
    • Piotr Nyczyk - SAM support
    • Zdenek Sekera - Shiva support
    • James Casey - all LCG services
    • Maria Dimou - LHC VO support
    • Daniele.Bonacorsi@cnafNOSPAMPLEASE.infn.it - PhEDEx
    • Yvan Calas - RB support and WMS support
    • Patricia Mendez Lorenzo - ALICE
    • Alasdair Ross or Italo Garr - NETOPS on duty

Detailed life-time of the event

  1. Decide on the intervention Type
  2. Is the service replicated?
  3. Is there an automatic fail-over procedure?
  4. Announce at LCGSCM one month before due date
  5. Prepare/backup/test restore procedures
  6. Prepare the list of affected services
  7. For long interventions GMOD announces several times and a lot in advance in case of external dependencies (services running in other sites)
    • when the date is fixed
    • one week before
    • the day before
    • at the intervention start
    • at the intervention end

  • GMOD should translate internal CERN services affected by intervention into the list of externally visible grid services (for example: LCG_SAME service in Oracle should be announced as SAM monitoring service).
  • For maximum clarity, the announcement should only contain details relevant to grid users, namely it shouldn't mention internal CERN services - only the grid services affected. Internal CERN services can be mentioned only if they are used directly by some of the grid users (for example experiments database services?).
  • All service managers should link from here a simple template for writing GMOD announcements and provide separately the mapping (table?) of internal CERN services (Oracle services, nodes, etc) to grid services (with indication of VOs). * The time that GMOD announcement is published before the intervention should be estimated in the way that would allow any external service managers to make additional announcements if their grid services are affected by the CERN grid services.

Did we forget anything?

Did we document the Plan B for the specific intervention per service affected?


This page is a result of discussions between Jamie, Maria, Piotr, Vikas and Vinod and the LCGSCM members. It was discussed at the LCGSCM of Wed. March 14th. -- Main.dimou - 19 Feb 2007
Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt LHCbIntervention.txt r2 r1 manage 3.0 K 2007-03-21 - 18:05 UnknownUser LHCb SRM Intervention example
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2007-05-30 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback