LCG Grid Deployment -
LCG Production Services
Definition of GMOD
- The GMOD is a service manager from the GD group, changing on a weekly Rota basis. S/he has a back-up, also a service manager from the GD group.
- The GMOD is supposed to be 'on duty' during working hours.
- The main function of the GMOD is to ensure that problems reported for GD managed machines are properly followed up and solved.
- The GMOD receives all tickets that are sent to the REMEDY mail feeds SERVICE.support@cernNOSPAMPLEASE.ch where SERVICE is one of: rb, wms, ce, lfc, fts, bdii, mon, px (myproxy server), sam, voms, vomrs.
- The GMOD should either solve this problem herself/himself, if possible, ask other service managers for help, or make sure the problem is handed over to the expert and followed up by them. In addition, the GMOD should ensure that the problem is acknowledged to whoever reported the problem in a reasonable time.
People acting as GMOD should refer to the
Remedy ROC Structure document for "official" details about workflow, roles and responsibilities defined for the IT Services within in the CERN Remedy PRMS.
This page deals with the specific details and duties of the GMOD "job" as done within the GD group.
How to contact the GMOD
- PHONE:
- primary: 164111 (+41764874111)
- backup: 164222 (+41764874222)
The primary phone lies with the GMOD, the backup with the GMOD back up.
GMOD rota
GMOD rota for the GMOD and his/her back up
GMOD meetings
There is google calendar of these items you can use. However the list here is the authoritive list of events and actions to take. Please
ask one of the other gmods to add you to the list of people who can maintain and view this version.
*
Before you go to the CCSR meeting, submit a written report for the minutes.
Responsibilities of the GMOD
The GMOD should proactively handle problems arising during her/his duty time. The main activity is to coordinate problem solutions and to inform people concerned.
- have a look and distribute the Remedy tickets
- you should look for - NON FIXED CASES: My Group Assignments
- follow up the status of the services we are responsible for GD services, July '07
- represent all services in the weekly meetings:
- Coordinate information sent outside, to the grid, about all the CERN-PROD services
- EGEE broadcast
(use To LCG Service Challenges responsibles target for WLCG services)
- Coordinate interventions in the GD services and with FIO (SMOD), e.g.:
- Mw upgrade
- Kernel upgrade
- Announce CERN production service interventions to the 9:00 daily meetings:
- Announce CERN production service interventions to the grid:
- It is the GMOD (it-dep-gd-gmod@cernNOSPAMPLEASE.ch) who will be responible for deciding if the announcement should be broadcast to 'the Grid' (via the CIC portal) and if additional info is required (e.g. to clarify for external users).
- Use the EGEE broadcast
for this, following the standard templates as defined here
- Renew the host certificates for nodes in production when they are expired: https://twiki.cern.ch/twiki/bin/view/LCG/GDReqHostCert
- Check the weekly CERN-PROD RC report and make sure that all unavailability longer than 2 hours are explained with the following format:
- Problem
- Cause
- Solution
- Most of them are related to site services, under FIO responsibility, so our task is to check that all are explained, and if not, get all information possible from SAM, correlate with the downtimes/broadcasts of the week (the gmod knows because she/he has been sending them and attending the morning meetings), and send all this to grid-cern-prod-admins@cernNOSPAMPLEASE.ch so they have all information and finally fill it. This should be done on Friday morning, before 2.00 pm
CDB/LANDB mapping
The service managers should check the mapping between the CDB and LANDB informations for the machines they are responsible, in particular for the following fields:
LANDB |
CDB |
Description |
Tag |
/system/cluster/name |
The name of your cluster (eg. gridvoms, etc.). |
Tag |
/system/cluster/subname |
The name of your subcluster, if any (eg. gridwms is a subcluster of cluster grid). |
Description |
/system/cluster/description |
The description of your cluster/node (eg. "gLite WMS (Workload Management System) 3.1". |
Main user of the device |
/system/cluster/usercontact |
The name of the main service manager. |
Note that you can specify the value of the CDB variables at the node level (eg. by editing template profile_wms101.tpl) or at the cluster level (eg. by editing template pro_service_gridwms.tpl). Please contact
SteveTraylen or
Yvan.Calas@cernNOSPAMPLEASE.ch if you have any question concerning CDB.
For the time being, GD is responsible of the following clusters:
LANDB Tag |
Description |
Comments |
sam-bdii |
BDIIs for SAM |
- |
sam-dpm |
DPM for SAM |
- |
gridfts |
FTS nodes |
landb to update. |
gdui |
GD-only official UI with incoming connectivity |
For GD only. |
gridlb |
gLite LB (Logging and Bookkeeping) 3.1 |
New Cluster. |
gridwms |
gLite WMS (Workload Management System) 3.1 |
New Cluster. |
gridrb |
gLite WMS 3.0 and 3.1 |
All the nodes belonging to this cluster will be moved to cluster gridwms in July 2007. |
lcgrb |
LCG RB (Resource Broker) |
- |
sam-mon |
SAM clients and servers |
- |
gridvoms |
VOMS nodes |
- |
For example, if you want to have the list of all the machines belonging to a given cluster (lcgrb for example), go the
netops web page
and fill the field "Tag" with string "lcgrb".
There is also a wiki page related to the actual status of the WMS, LB and RB nodes here.
Finding Information abut Clusters and Nodes
See
GModClusterNodeQueries.
GMOD Reports/Presentations
Useful links
Useful e-mails and mailing lists
How to use Remedy
you can use one of the following, but the Windows client is recommended
- the (Remedy web interface
)
- the Remedy client
available for any Windows PC/Laptop
- the Remedy client available on the Windows Terminal Service via remote desktop (if you do not have the GUI standard on CERN SL type "rdesktop cernts -a 15 -g 1280x1024" in a terminal window)
- the mail-feed to Remedy, i.e. by submitting email to arsystem@sunar01NOSPAMPLEASE.cern.ch with special keywords on the message Subject. Instructions here
.
It may happen that a ticket previously assigned to the GMOD or to a service manager in GD had to be re-routed to people in FIO. In such a case, FIO service managers have explicitly asked us
not to assign tickets to the relevant expert remedy sub-category (!) but to leave them in
"General", because they want all their tickets to be processed by the SMOD.
The Cern ROC set-up a page with
Remedy Tip and Tricks, where Gmods may find useful hints to use effectively some advanced features (e.g. Interaction with GGUS, Advaced Searches).
Some useful information (connection) can also be found in a
FAQ page, this one more specifically addressed to the Remedy-GGUS inteface and therefore not directly in the scope of the GMOD.
Remedy CERN homepage is
http://service-it-remedy.web.cern.ch/service-it-remedy
List of GD service managers and service experts
http://egee-docs.web.cern.ch/egee-docs/ROC_CERN\gd-service-mgrs-experts.htm
Instructions for EGEE broadcasts
Remember to communicate the information concerning CERN production services to the SMOD (
it-dep-fio-smod@cernNOSPAMPLEASE.ch) and to the MOD (
mod@cernNOSPAMPLEASE.ch) to ensure that they are also aware.
Guidelines to send broadcasts:
- Use the EGEE broadcast
for this, following the standard templates as defined here
- Follow WLCG procedures as specified in Scheduling of Service Interruptions at WLCG Sites, mainly regarding:
- Timelines for announcements
- Announcement for some cases to the operations meeting through the site reports
- Use UTC time (or local + UTC)
- Write it from the user point of view, mentioning the way the service will be affected:
- FTS service will be down, instead of LGCR rack down, or DNS service not available
- List affected grid production services and VOs
- Put a meaningful title, starting with the official site name related to the intervention, e.g. CERN-PROD:
- Short and concise messages are preferred
Selection of the recipients:
- Always set "News publication in all CIC portal views" to yes
- If it ONLY affect T1s (and no other sites): To WLCG Tier-1 contacts
- If it affects the COD activity: To CIC-on-duty (CIC-on-duty mailing list)
- Include always: ROC Managers (ALL ROC Managers by default)
- Include always: Affected VO managers; if this is not know, all VO managers (by default)
- Affected VO users, only when affected, do not SPAM VO users mailing lists!
- If affects all sites or a subset of T1s/T2s: Production Site Admin (All by default)
- If affect the PPS service: PPS Site Admin (All by default)
- Examples:
- SAM will be down: it affects the COD, all production sites, PPS, all VO managers (SAM is also used by the VOS), ROC managers
- VOMS intervention: Affected VO managers, affected VO users, ROC managers
- Castor intervention: WLCG Tier-1 contacts, ROC managers, VO managers, VO users
- FTS intervention: WLCG Tier-1 contacts, ROC managers, VO managers, VO users
Example of a good broadcast text:
Dear WLCG users,
On Thursday, February 22 from 8:00 am until 11:00 am UTC we are planning an
intervention on our Oracle cluster.
During that time the following Grid services will be down at CERN:
* FTS
* LFC
* VOMS/VOMRS (ALICE, ATLAS, CMS, LHCb, DTEAM, OPS, Sixt, Unosat, Geant4)
* SAM and GridView
* FCR
The intervention will take 3 hours and should be finished by 11:00 am UTC.
Thank you for your understanding.
Grid Manager on Duty at CERN
VOMS Service interruption announcement template for GMOD use
Publish on the CIC portal with the following options:
News on cic.gridops.org: YES
Email to:
ROC managers,
VO managers of ALICE, ATLAS, CMS, LHCb, DTEAM, Geant4 and OPS *only!!*,
VO users of ALICE, ATLAS, CMS *only!!*,
Production and PPS Site Admins **only if gridmap file generation is affected !!**
Add in copy on the CIC portal OSG contacts
goc@opensciencegrid.org and rquick@iu.edu **NB!! There is no such button on the broadcast form!!**
Title: DATE TIME TIMEZONE scheduled interruption of the CERN vomrs and voms services
Text:
All voms and vomrs services (registration, gridmap file update and proxies) will not be accessible
during DATE TIME TIMEZONE. Reason: TYPE THE REASON HERE.
This applies to VOname = ALICE, ATLAS, CMS, LHCb, DTEAM, OPS, Sixt, Unosat, Geant4
Please contact project-lcg-vo-dteam-admin@cern.ch in case of problem.
Thank you for your understanding.
-- Main.diana - 09 Oct 2006