User Support Advisory Group (USAG)

Work started at the March 6th 2008 ESC meeting for a smooth transition from the EGEE-II ESC to a new body in EGEE-III, the User Support Advisory Group. Chaired and organised by the Operations Coordination Centre (OCC), this group will be composed of the VO managers (or their representatives) and representatives from other activities using GGUS. Its role is to advise GGUS on development directions both for the tools and the processes. Description of Work (DoW) document extract, around page 101 in the pdf version.

Mandate

The USAG mandate is to:

  1. Examine requirements from all relevant parties - VOs, ROCs and Sites, identify common points and differences and see how they influence the Grid Support processes and tools.
  2. Consolidate all requirements taking into consideration the needs and operational procedures of ROCs and sites.
  3. Advise on the consequent evolution of the Global Grid User Support (GGUS), which is the core system of the Grid support effort.
  4. Report on the development, testing, and deployment plan for new GGUS features compared to the recommended evolution.
  5. Make known to the appropriate forum - VOs, ROCs, sites and all other SUs the suggested GGUS system evolution and the procedures that need to be updated accordingly.
  6. Define the expectations from all Support Units (SUs) via Operational Level Agreements (OLAs), get acceptance by the SUs and leave OLA enforcement to the management partners involved.

Meeting frequency: Monthly Participation: All ROCs (representatives should be authorised to comment on their ROC's TPM commitments), the GGUS developers, OCC (member chairing USAG: M.Dimou), NA4, VOs, Sites (in agreement with their ROC).

Mandate, membership and meeting frequency discussed at the 10/4/2008 USAG meeting. Submitted for approval to the OCC and ROC managers on their 15/4/2008 meeting.

Meetings' List {Notes linked from each agenda)

Wrap-up

The last EGEE USAG meeting took place on 2010/04/21. Agenda and minutes linked from here on its succession.

Projects

Direct routing to sites faq

  • Q: What exactly is 'Direct routing to ALL sites for ALL GGUS tickets'?: A: I open a GGUS ticket via the web form https://gus.fzk.de/pages/ticket.php?gotopg=home selecting the 'Notify Site' from the drop-down menu. The ticket is automatically assigned to the relevant ROC, it doesn't pass through the TPM and the remote Site is notified by email via the mailing list published in GOCDB. For US sites, the site name and contact email address is taken from OIM.
  • Q: TEAM tickets are direct routed tickets also to ALL sites? A: Yes, as of Feb 2009. savannah #106859. Between July 2008 and Feb 2009 TEAM tickets were routed directly only to Tier1 sites with contact info taken from https://twiki.cern.ch/twiki/bin/view/LCG/TierOneContactDetails
  • Q: The direct routing is already available in production or only in the test system? *A:*It was made available in production for normal tickets in the Jan 2009 GGUS Release.
  • Q: Where from and how often is the site info taken? A: Site names and contact email addresses are taken automatically via web services once per week on Monday nights from GOCDB (for EGEE) and OIM (for OSG) sites. The changes are marked.
  • Q: Are these really ALL sites? A: 'Test' and 'suspended' sites are not included. 'uncertified' sites are included as decided at the 2008-11-27 USAG.
  • Q: Do we still need TEAM tickets now that any normal user can submit a GGUS ticket notifying the site directly? A: The remaining added value of TEAM tickets is the co-ownership of the ticket by all TEAMers (LHC experiment shifters) which allows them all the update the same ticket.
  • Q: Can we now send ALARMS to ALL sites too? A: This functionality won't expand to ALARM tickets because only LCG Tier1s are bound to such response times.
  • Q: Do you have more complete documentation on TEAM and ALARM tickets' flow? A: Yes, please have a look in the relevant links on https://gus.fzk.de/pages/docu.php

Proposal for special GGUS routing based on specific ticket attributes [Released on 3 July 2008]

Executive summary

  1. Grid partners, especially VOs require a direct way to report urgent problems, that must be solved within hours, to the service experts/site responsibles.
  2. To do this, a change of today's GGUS routing is needed, which, for certain cases, with specific criteria, will not involve the TPMs.
  3. As the change is major, explicit agreement on authorised submitters, workflow and designated supporters is absolutely necessary. The specifications:
    1. Ordinary tickets are:
      • Submitted by a member of "The Team", i.e. a group of experienced VO users. The submitter selects the specific site to notify from the menu.
      • Automatically assigned to the relevant ROC at submission time AND copied to the official contact email of the site, taken from GOCDB.
      • Viewable by all as usual. Updates are only possible and notifications go to "The Team" and the supporters involved.
    2. Alarm/Emergency tickets are:
      • Submitted by an Authorised "Alarmer", i.e. one of the 3-4 Grid experts in the VO.
      • GGUS, recognising the Authorised "Alarmer" signature from the browser or the signed email at submission time, will assign the ticket to the ROC, with the "affects specific site" flag set, automatically copying the Site's alarms' mailing list, taken from GOCDB.
  4. We propose to do this work in GGUS, because opening GGUS tickets, in addition or instead of mailing lists, hypernews, savannah tickets, broadcasts, Elogs and meeting notes, gives the following advantages:
    1. persistent URIs to be quoted and linked, when needed.
    2. possible ticket attributes' expansion, i.e. new priority/routing criteria and/or more SUs.
    3. existing (and also expandable) escalation reports that show stalled tickets needing attention.
    4. automatic ticket creation via email by users or mailing lists.
    5. direct link to related savannah tickets on middleware bugs or any other related issue on the web.
    6. possibility to turn a ticket solution into a FAQ, without effort by the supporter.
    7. availability of a Knowledge Base of solved tickets for consultation.
    8. although GGUS is not an alarm system a work item is now open for including the required information in the GGUS tickets.

Work items

  1. Implement the notion of "The Team", i.e. a group of VO expers/managers/shifters have update and notification rights on a GGUS ticket equal to the submitter.
  2. Implement automatic email copy to the site's contact email. This implementation option covers Tier1s and Tier2s.
  3. Implement Protected Ticket Types (PT2), for use by tickets submitted by "The Team" and/or Authorised "Alarmers".
  4. Understand who needs an expansion of the, partially already existing, Service Ticket Types (ST2), i.e. action-triggering and not problem-reporting tickets and how to implement them.
  5. Maintain "Team" and "Alarmers" members defined in simba mailing lists via automatic membership extraction.
  6. If helpdesk@ggusNOSPAMPLEASE.org is a member of another mailing list, the GGUS ticket opening will be automatic (point 4.4 in the Summary above) but every new mail message, that does not refer to an existing GGUS ticket will generate a new ticket. This is a feature of the GGUS mail parser, which was explicitely agreed on by the USAG predecessor (ESC) before it got implemented. An attempt to re-visit this was withdrawn by the requestors. GGUS is not a competitor to alarm lists. It is just easier to find back what happened with a given problem via a URL than via a mail thread. Should we open manually GGUS tickets for problems reported in Tier1 "alarm" lists' for this reason?
  7. To understand better on-going work on 'alarms'
  8. Draw technical conclusions from the Results of the Site Survey Questionnaire. According to the plan the results are due on April 18th 2008.

Specification collection

Processes/procedures not defined/dependent on GGUS

  1. OLAs, SLA, SLDs or MoUs should be formally agreed and actually respected (monitored?). Existing documents so far:
    1. Service Level Description (SLD) between ROCs and Sites.
    2. Operational Level Agreement (OLA) between GGUS and the TPMs. GGUS tried to take SLD metrics into account in the escalation reports but the ROCs (Tier1s) didn't agree.
  2. There is, often, a new tool/procedure decided in some meeting every few months, but we don't always know about it. In a limited resources' environment, it is good to profit from an existing "because it's there" infrastructure that already links all Grid partners, simply adding missing functionality. Recent fashion items:
    • MoU in GGUS.
    • Fabric/GGUS interface.
  3. Local ticketing systems at the ROCs/Sites (Tiers) match the decentralisation spirit of EGEE III and, in the future, EGI. Nevertheless, they introduce a big testing and service verification effort when GGUS expands and the interfaces must be adapted. If the National Grid Initatives (NGIs) are, now, about 36 and if most of them plan to have an own ticketing system, major GGUS releases will be taking months of porting effort.

Advantages summarised

  1. Proposed features reduce routing time AND human efforts/costs inline with EGEE-III goals.
  2. Use of an existing tool (GGUS) leverages development and maintenance efforts.
  3. Continues to give all Grid users (VOs and Sites) a single, familiar interface for problem resolution.

Other VO-requested GGUS work-items

  1. Implement OPN ticket handling. Understand whether this is a special PT2 case.
  2. Improve GGUS browsability.
  3. Add MoU-related fields on the GGUS ticket submission form and the GGUS search engine. Statistics should be produced by the Operations Automation Tools (OAT) team and not via the GGUS escalation reports.
  4. Provide RSS Feeds in addition to email notifications.
  5. Generate GGUS tickets out of fabric monitoring tools, e.g. Nagios.

Periodic ALARM ticket testing rules

There are 2 kinds of tests:

  1. Routine GGUS integrity monthly test, initiated by the GGUS developers on the day of a GGUS release. Its purpose is to ensure that the new GGUS features didn't break the ALARM ticket workflow.
  2. Periodic test (frequency decided by WLCG), initiated by the Authorised ALARMers in the VOs, checking the full chain, including taking action on tested critical services per ALARM type. This is the list of agreed WLCG Critical Services https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCriticalServices
*NB! Information on ALARM handling by the T0/T1s collected in autumn 2010 is in https://savannah.cern.ch/file/AlarmMailHandling.pdf?file_id=16222

Routine GGUS integrity monthly test details:

Tier0 and Tier1s please remember:

  • A test ALARM GGUS ticket will be initiated by the GGUS developers (who are also authorised ALARMers) on the day of every GGUS release as per savannah one ticket per GGUS release.
  • The alarm notification email will be signed by the GGUS certificate as always. It is the sites' decision if this notification email becomes a sms.
  • The operators, when receiving the GGUS email notification should reply to the message. This will put an entry in the GGUS ticket's Public diary. Thus the successful delivery of the ALARM notification will be confirmed. Then the operators should contat the relevant service managers, as if the ALARM were real. Ticket closing should be done by the service managers, acknowledging receipt.
  • The test tickets are sent in 3 slices per convenient timezone:
    • Asia/Pacific right after the release,
    • European sites early afternoon (~12:00 UTC),
    • US sites and Canada late afternoon (~18:00 UTC).

Periodic GGUS ALARM ticket test - full chain details:

  • Timestamps:

  1. Fix the testing week at the WLCG daily Operations meeting. It should the week preceding a GDB 3 times per year (list of GDB dates).
  2. Remind VOs [Mar & Jul & Nov]GDB - *2*weeks (Maria D. during the WLCG daily meeting).
  3. VOs send ALARM emails or submit ALARM tickets during [Mar & Jul & Nov]GDB - 1 week.
  4. VOs submit conclusions no later than Friday of the testing week.
  5. wlcg-scod to circulate conclusions on Monday of the GDB week.
  6. wlcg-scod to present conclusions to the f2f PMB on Tuesday of the GDB week.

  • Procedure:

  1. Inform Tier1 sites at the WLCG daily meeting that there will be an ALARM ticket testing during [Mar & Jul & Nov]GDB - 1 week (day time) without precision on the time (like a fire alarm).
  2. The VOs will launch the ALARMs during working hours of the target site . Clearly write TEST on the ticket subject (not to wake up people, especially for those sites where an alarm email becomes a brief sms). The ticket description should require the site to follow the procedure foreseen if this were a true alarm (e.g. your VObox/CE doesn't work at your site). The site will have to close the GGUS ticket confirming they understand what they would have done, had this been a true alarm.
  3. Tier0 specific: When testing ALARMs with the Tier0, please provide the following information to make sure the operators will handle the email appropriately for the complete chain to be tested. If this information is missing, the operator will ask for it - please answer, otherwise the test would be incomplete.
    1. specify the affected service: one of FTS, SRM, LFS
    2. include the line This is a request for a data operations piquet call
    3. provide a "contact back" phone number
  4. If a site is on scheduled downtime during the whole of the ALARMS testing week, the site should be omitted from this round of tests.
  5. Use ALARM examples provided by the LHC experiment VOs.
  6. Report results at the WLCG daily Operations meeting.

History of Periodic GGUS ALARM tests, full chain:

First round in 2010: Maria reminds the VOs on May 27th. The tests will run in the week of May 31st, the GDB being planned for Jun 9th. Related ticket savannah #114705.

Third round in 2009: Maria to remind the VOs on Sep 28th. The tests will run in the week of Oct 5th, the GDB being planned for Oct 14th. Related ticket savannah #109566.

Second round in 2009: Maria to remind the VOs on Mar 23rd. The tests will run in the week of Mar 30th, the GDB being planned for April 8th. Related ticket savannah #107452.

First round in 2009: Maria to remind the VOs on Feb 23rd. The tests will run in the week of March 2nd, the GDB being planned for March 11th. Related ticket savannah #105104.

How to register your TEAM and ALARM members in VOMS

VO Admins of the LHC Experiment VOs go to https://lcg-voms.cern.ch:8443/vo/YourVOname/vomrs, create Groups (recommended) called "team" and "alarm" ( case is important )and add the DNs of your authorised teamers and alarmers. The TEAMers and ALARMers will not neeed to obtain a VOMS proxy in order to submit a relevant GGUS ticket. VOMS will be simply consulted daily by GGUS instead of a static list as during the period July 2008 to June 2009. If you can't remember who is who or if you wish to use VOMS Roles instead please contact Guenter.Grein@iwrNOSPAMPLEASE.fzk.de for the Teamers' list and read the Alarmers' list in the Alarms' twiki. Relevant savannah ticket #104835.

GGUS-to-OSG routing (July 2009 snapshot)

Status of 2009-07-01

  1. 'Usual' tickets, i.e. the submitter does not select any site on the web submission form or submits via email to helpdesk@ggusNOSPAMPLEASE.org: The TPM assigns manually the ticket to the Support Unit 'OSG' which sends email to ggus@tickNOSPAMPLEASE.globalnoc.iu.edu.
This is the right way to do things. TPMs who know that some american sites are still under ROC CERN, they may assign to the Support Unit 'ROC CERN', which sends email to roc-cern.support@cernNOSPAMPLEASE.ch and opens a CERN PRMS ticket. The ROC CERN staff close the PRMS ticket and give the GGUS ticket to OSG. This is not the right way, it hopefully hardly ever happens.
  1. alarm tickets, i.e. only submittable by 4-5 (maximum) LHC experiment members, 'elected' as authorised alarmers, known to GGUS by
their personal certificate DN. Submission possible via a special GGUS web form or via a special email template to helpdesk@ggusNOSPAMPLEASE.org. The TPM never sees these tickets. They are automatically assigned to Support Unit 'OSG' which sends email to ggus@tickNOSPAMPLEASE.globalnoc.iu.edu. In addition a ggus-signed email goes to the Emergency email addresses listed below. The submitters can only select these USA sites when submitting:

Site name Emergency email
US-FNAL-CMS (name agreed by wLCG) cms-t1-page@fnalNOSPAMPLEASE.gov
US-T1-BNL (name agreed by wLCG) bnl-alarms-l@listsNOSPAMPLEASE.bnl.gov

    1. a. team tickets: i.e. only submittable by LHC experiment shifters, known to GGUS by their personal certificate DN via a VOMS Group/Role) AND
    2. b. tickets with a specific Site selected from the Notify Site field. Submission possible via a special GGUS web form or via a special email template to helpdesk@ggusNOSPAMPLEASE.org for teamers (3.a.) and via the default submission form for other users (3.b.).
The TPM never sees these tickets. They are automatically assigned to Support Unit 'OSG' which sends email to ggus@tickNOSPAMPLEASE.globalnoc.iu.edu. In addition a ggus-signed email goes to the Contact email addresses listed below. The submitters can select these USA sites when submitting:

Site name Contact email NB!! situation changed since July 2009
USCMS_FNAL_WC1(taken from GOCdb) cms-t1@fnal.gov(taken from GOCdb today)
USCMS_FNAL_WC1-CE (taken from OIM) no address available today. must be provided in OIM
USCMS_FNAL_WC1-CE2 (taken from OIM) no address available today. must be provided in OIM
USCMS_FNAL_WC1-CE3 (taken from OIM) no address available today. must be provided in OIM
USCMS_FNAL_WC1-CE4 (taken from OIM) no address available today. must be provided in OIM
USCMS_FNAL_WC1-SE (taken from OIM) no address available today. must be provided in OIM
USCMS_FNAL-XEN (taken from OIM) no address available today. must be provided in OIM
-------------- ----------------------
BNL-LCG2 (taken from GOCdb) grid@rcfNOSPAMPLEASE.rhic.bnl.gov
BNL-ATLAS-1 (taken from OIM) no address available today. must be provided in OIM
BNL-ATLAS-2 (taken from OIM) no address available today. must be provided in OIM
BNL-ATLAS-SE (taken from OIM) no address available today. must be provided in OIM

  • There is no response time restriction for categories 3.a. and 3.b.
  • Sites with no addresses today simply receive no email today. The automatic assignment to Support Unit 'OSG' is done.
  • Compromise solution for now: GGUS to use the same address cms-t1@fnalNOSPAMPLEASE.gov for FNAL and grid@rcfNOSPAMPLEASE.rhic.bnl.gov (for BNL) for every relevant entry retrieved from OIM.
  • We have no Tier2 information today from OIM. Atlas needs this.

!!NB!! Situation changed after the LCG MB of 20090707 See minutes here!!! Related ticket https://savannah.cern.ch/support/?107531

BNL_ATLAS names

Situation on 20091118: On the GGUS ticket submission form, the GGUS search engine, (You need a valid certificate to open these forms) the TEAM ticket submission form (form with access restricted to TEAMers only), the "Notify SITE" menu shows 4 names for BNL taken nightly from OIM. This is confusing for the ticket submitters. OIM should present in the view used by GGUS only the 'resource group' name BNL_ATLAS with 2 valid email addresses, one for 'site contact' and one for 'site alarms.' More details in https://savannah.cern.ch/support/?109779

20091118 OSG-GGUS tel. meeting on BNL-ATLAS naming in GGUS

Agenda: http://indico.cern.ch/conferenceDisplay.py?confId=74529

Participants: Ruth Pordes, Michael Ernst, Burt Hotzman, Rob Quick, Maria Dimou Apologies: Guenter Grein Absent: John Hover, Jamie Shiers

Conclusion: The BNL_ATLAS facility managers request OSG developers to equip the Resource Group in OIM with valid 'Contact email' 'Alarm email' values.

Actions:

  1. Rob to check with Arvind and the other OSG developers on the timescale for this. Progress should be monitored via https://savannah.cern.ch/support/?109779
  2. Burt to check with USCMS what are the values they want for the US-FNAL-CMS Resource Group.
  3. GGUS developers to use the Resource Group instead of the multiple individual resources when 1 is done.

Reason for this meeting: All GGUS views containing BNL names today take the values of the OIM individual resources (4 options for BNL, 7 options for FNAL). This is confusing for the experiment members opening GGUS tickets.

Material:

  1. Now: Formal tracking in savannah: https://savannah.cern.ch/support/?109779#comment1 and more recent comments.
  2. July 2009: When we agreed to make 'Contact email' available in OIM for GGUS: wLCG MB presentation on Notifying OSG Resources AND simultaneously assigning automatically GGUS tickets to OSG: https://twiki.cern.ch/twiki/pub/LCG/MbMeetingsMinutes/LCG_Management_Board_2009_07_07.htm
  3. July 2009: OSG-GGUS tel. meeting to prepare the above MB: http://indico.cern.ch/conferenceDisplay.py?confId=62962
  4. March 2009: OSG-GGUS tel. meeting to prepare the move from GOCDB into OIM: http://indico.cern.ch/conferenceDisplay.py?confId=54492
  5. December 2008: OSG-GGUS tel. meeting to understand wLCG request for Direct Site notification: http://indico.cern.ch/conferenceDisplay.py?confId=46350

Site scheduled interventions

Decided at the SA1 coord. meeting @ EGEE'09. Documented in SA1 coord. meeting on 20091006:

  1. When a planned service intervention impacts end-users, it should be declared.
  2. In case of doubt, e.g. a Storage Element will be in downtime and the site doesn't know whether the VO has replicated the data elsewhere, it should be declared.
  3. When a site admin declares the site intervention in GOCDB, an announcement is issued immediately.
  4. The announcement is repeated one day before the intervention.
  5. The announcement is repeated one hour before the intervention.
Up to the 'recipients' to decide if they obtain the information via email notification, RSS feeds, calendar entries, cic portal consultation. All notification methods will be sent 3 times, i.e. at declaration time, one day before and one hour before the intervention. Exception: when the declaration is done less than 48hrs before the intervention, no one day before announcement is issued, the two would be too close.

VO-related stuff

Mostly LHC VOs but also procedures for newly registered VOs in https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport

GGUS ticket assignment automation in EGEE III

  1. ALARM and TEAM tickets automatically assigned to the relevant ROC, IF the submitter selects a site which is a wLCG Tier0 or Tier1) for LHC experiment VOs. In operation since 2008-07-03.
  2. All tickets by any submitter automatically assigned to the relevant ROC IF the submitter selects a site on the ticket submission form. In operation since 2009-01-21.
  3. Ticket submitter is able to directly assign to a ROC on the ticket submission form. In operation since 2009-12-09.
  4. Ticket submitter is able to directly assign to a VO (applies to few VOs): In operation since GGUS early days but being reviewed now via https://savannah.cern.ch/support/?111481
  5. Ticket monitoring via the ggus-info@cernNOSPAMPLEASE.ch members leads to ticket re-assignment to relevant Support Units directly. In operation since GGUS early days but especially since the creation of GGUS escalation reports in EGEE III and the implementation of ticket escalation by the submitter facility on 2009-01-21.

These improvements caused a drop of the ticket triage needed by the TPMs by 50% (used to be 80-110 tickets/week in 2008 and became 35-60 tickets per week in early 2010).

-- MariaDimou - 02 Apr 2008

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf GGUS-T0-20100323.pdf r1 manage 130.2 K 2010-03-24 - 17:51 MariaDimou  
PowerPointppt GGUS-T0-20100323.ppt r1 manage 326.5 K 2010-03-24 - 17:52 MariaDimou  
Unknown file formatpptx GGUS-T0-20100323.pptx r1 manage 103.8 K 2010-03-24 - 17:47 MariaDimou  
Edit | Attach | Watch | Print version | History: r65 < r64 < r63 < r62 < r61 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r65 - 2011-11-10 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback