LHC VO User Support
Maria Dimou (CERN/IT/GD/OPS) since March 1st 2007 is the contact person on Grid User Support questions for the LHC experiment VOs.
Objectives
Provide and record via the
ESC meeting
:
- Information flow in tickets or procedures and documentation improvement.
- Ticket reviews with the 1st line supporters within the LHC VOs.
- Concrete proposals in the GGUS tickets aiming to reduce the ticket turn-around time.
- Awareness increase of users, supporters and sites of the Grid's main information sources, i.e. GGUS, GOCdb, gocwiki, Operations' portal.
Practices
Ticket assignment analysis
As a Ticket Process Management (TPM) monitor, analyse chunks of GGUS tickets to ensure that:
- 95% of the tickets pass to status ASSIGNED in less than one hour (within reason).
- 95% of the tickets assigned to the correct Support Unit (SU) the first time.
Remarks from this analysis are in the tickets themselves and in
a TPM report
discussed at the
ROC managers' meetings
every two weeks.
Open ticket review per VO
Check regularly the latest list of tickets per LHC Experiment VO in Status=_open_ 
. Discuss difficult cases with the relevant SU, EIS colleagues and 1st-line supporters in the VO.
We could establish a session at the ESC meeting
to do that.
User Supporters for non-LHC VOs who wish to construct such a list at all times, should use this URL (substitute
VOname with its value):
https://gus.fzk.de/ws/ticket_search.php?supportunit=all&vo=VOname&status=open&radiotf=1&timeframe=no
Meetings
- With Grid Sites weekly on Monday at 16:30hrs at the Operations' meetings.
- With Grid Service Managers weekly on Wednesday at 10:00am at the LCGSCM
(Service Coordination Meeting).
- With Experiment Supporters fortnightly on Monday at 14:30hrs at the ECM
(Experiment Coordication Meeting).
- With ROC and GGUS developers monthly on Thursday at 11:00am at the ESC
(Executive Support Committee).
- With experiments, ticketing system experts, IT service managers on 2011/03/21 https://twiki.cern.ch/twiki/pub/LCG/WLCGVOCCoordination/VObox_application_support_by_WLCG_VOCs.txt
ALARM tickets' quality
The meeting took place on 2010/05/10. It was an
action from the 2010/04/29 Tier1 Coordination meeting. Agenda
http://indico.cern.ch/conferenceDisplay.py?confId=93347
Purpose: To discuss with experts from GGUS, GOCDB, Operations' portal, CERN PRMS, Tier0, Tier1 service managers and LHC VO Authorised ALARMers, analyse issues we encountered in production or during regular tests like:
the monthly tests after each GGUS Release.
Present: [Peter Kreuzer (CMS), Maite Barroso (T0 services), Maria Dimou, Maria Girone, Jamie Shiers (Experiment Services)](CERN), Gareth Smith (
RAL), Jon Bakker (FNAL), Cyril L'Orphelin (
IN2P3), Joel Closier (LHCb), Gilles Matthieu (GOCDB), Xavier Mol (KIT), [Guenter Grein, Helmut Dres] (GGUS), Di Qing (Triumf), Josep Flix (PIC).
What must work at all times - decisions per agenda item:
- GGUS front-end and database. So that shifters are always able to submit or update a ticket, See savannah #113831 for front-end and savannah #101122 for the database.. The latter is in service since April 2010. GGUS, with German Grid funds will make the fail-safe front-end available during the 2nd half of 2010. ACTION_1: Torsten should enter a detailed plan in the ticket #113831 and ACTION_2 the wlcg-smod should regularly emphasise this requerement at the MB.
- GGUS extract of Authorised Alarmers from VOMS. So that Authorised Alarmers can always be recognised as such. Service in place since June 2009. savannah #104835. If the nightly update returns an empty list the script used for the synchronization stops and the Alarmers of the previous day remain valid for GGUS. No action.
- GGUS extract of Site names and contact/emergency emails from GOCDB. Tier1 emergency emails were found empty in GOCDB twice in October 2009 and April 2010. See savannah #15009 . GGUS extracts by script the agreed GOCDB fields every night. Gilles decided to introduce a work-around directly at the database level to not allow a NULL value in the 'Emergency Email' field of Tier0/1 sites. The reason was that GOCDB4 will be out at the end of June 2010. ACTION_3 Gilles and Guenter to test GGUS-GOCDB4 to avoid surpises at release time. Maite asked whether LHCVOname-operator-alarm@cern.ch is still needed or if we can have one common address, as in GOCDB for the whole Tier0. ACTION_4 MariaD will investigate current needs and discuss at the next TIer1 coordination meeting on 2010/05/20
.
- Correct and timely Site downtimes' notification from the Operations' Portal. So that no ALARMs are opened against sites on scheduled downtime. Cyril proposes not to use downtime notifications to discover the site status but the Operations' portal web interface for current downtimes
. CMS and LHCb are happy with notifications, are open to the web interface evolution but for now they report that they use the google calendar (CMS example)
. Ideally they would like to see downtimes taken into consideration by the dashboards. This request will also be discussed at the next Tier1 coord.meeting. ACTION_5 MariaG will prepare with Julia Andreeva.
- Clear operators' instructions for all relevant WLCG services at the Tier0 and all Tier1s. The list of critical services must be checked for updates. The review of instructions per service will be done in one of the coming Tier1 coordination meetings. ACTION_6 MariaG to include this in the agenda when appropriate.
- Smooth and quick assignment to the Tier0/Tier1 local ticketing system, e.g. GGUS-PRMS for Tier0. See Tier0 presentation at the Tier1 coordination meeting of March 25th 2010. Every ALARM ticket to the Tier0 is automatically assigned to CERN_ROC. This now means that CERN/IT/PES members dispatch the relevant (automatically created) Remedy PRMS ticket to the appropriate category. 3rd level support is only involved when the incident reveals a middleware bug. In such cases a savannah bug is opened and linked from/to the relevant GGUS ticket. MariaG asked Maite to check https://gus.fzk.de/ws/ticket_info.php?ticket=57053
which needed to be re-classified and required a special 3rd level intervention. Guenter said that CNAF, PIC, FZK and CERN have local ticketing systems so possible delays in correct dispatching can happen there as well. ACTION_7 Should we review each of these sites at the Tier1 coordination meeting?
- Should we continue the periodic tests 3-4 times/year before a GDB according to these testing rules ? Yes! ACTION_8 Should we run the tests in time for the June 9th GDB
? If yes, a reminder should be announced at the daily meeting during the last week of May and Authorised ALARMers should run tests during the 1st week of June.
- A.O.B. by Peter:
- It is at the discretion of the VO to decide the number of Authorised ALARMers by putting them in the right VOMS Group. See here HowTo instructions.
- There is no way to convert a usual or TEAM ticket to an ALARM one. One has to re-open and make reference to existing unsatisfied tickets for emphasis.
- A 'critical' service doesn't automatically assign the ticket priority level. The expert will decide if opening an ALARM ticket is the safe way to obtain the solution in the time required. Abuses will be shown by the drills we perform for every WLCG MB. Example from the 2010/05/11 MB, slides 3 to 10
.
Per VO
- With 1st-line supporters within the Experiment in ad-hoc meetings.
Documentation review
As part of the
it-dep-gd-ops-us@cernNOSPAMPLEASE.ch team improve the online documentation for supporters linked from
https://gus.fzk.de/pages/info_for_supporters.php
Reviewed.
FAQs defining VOSupport(VOName) Support Units became obsolete with the introduction of VO-ID cards. Details in
savannah bug #102661
Document 1300
Reviewed. Updated version in
savannah bug #102727
Proposal for VO documentation changes
The following concerns mostly the
GGUS Documentation section for VO Users, link https://gus.fzk.de/pages/docu.php#2
Row number |
Current text |
Proposed change |
Who will do what |
1 |
VO Tools description (2005-09-16) |
Verify that this link is all that exists about VO Tools.Remove date and link recent SAM/FCR Critical tests' FAQs |
Maria/Judit/VO contacts |
2 |
VO BOX description |
This LCG User Guide chapter looks outdated |
Maria/Patricia/other ALICE experts to give new pointer |
3 |
VOBOX How TO |
Verify that this link is the best and of interest to all VOs |
Maria/Patricia/VO Contacts |
4 |
LCG Experiment Integration Support |
This group and website no more exists. This entire row should be changed to LHC Experiment VO Support and point to this page for the moment |
GGUS developers |
5 |
LCG EIS Documentation and tools |
This group and website no more exists. This entire row should be removed |
GGUS developers |
6 |
Experiment Computing Sites, including Documentation |
Contains 8 VOs only. Very outdated. All individual VO names should be removed and the VO-ID cards' list should only appear. VO web pages appear there. |
GGUS developers |
7 |
A list of VOs supported by EIS can be found... |
This group and website no more exists. This entire row should be removed |
GGUS developers |
8 |
All links anywhere on the page https://gus.fzk.de/pages/docu.php pointing to the LCG-2 User Guide (2 occurences) |
Obsolete. Should be replaced by the glite User Guide |
GGUS developers |
9 |
The link (3rd row from page top pointing to the LCG-2 User Scenario |
Obsolete. Should be removed |
GGUS developers |
10 |
VOMS Administration and User Guide (Date: 2005-03-17) |
Replace link by this for VOMS core and {http://edms.cern.ch/document/572406][this for VOMS Admin]] until further notice |
Maria will check with VOMS developers but GGUS developers should make the change now |
11 |
Instructions for installing and configuring an LCG-2 site |
Obsolete link. Replace by the glite entry point |
GGUS developers |
12 |
Instructions for installing and configuring VOMS |
Obsolete link. Replace by http://edms.cern.ch/document/818502 |
GGUS developers |
13 |
Information System Trouble Shouting Guide |
Grammar wrong (shooting) and link invalid. Point to https://twiki.cern.ch/twiki/bin/view/EGEE/InfoTrouble |
GGUS developers |
GGUS enhancement requests
Create and
follow up savannah tickets with new functionality
asked by the 1st line LHC VO supporters.
Follow the acceptance and implementation status via
GGUS plans and releases
.
Newly registered VO wants to provide support via GGUS
You have recently completed your VO-ID card via
http://operations-portal.egi.eu/vo
.
The GGUS development team was notified that you wish to provide support to the users of your VO via
http://ggus.org
.
In order to do that your VO experts will have to become part of a
Support Unit (SU). In practice this means:
How to set-up the mailing list(e-group) for ticket notifications: This mailing list will be either set up at CERN or at the site SU requester him(her)self.
If the e-group is set up at CERN it will be labelled
yourVOname-grid-support@cernNOSPAMPLEASE.ch.
Owner of the e-group will be the SU requester. If the SU requester (who should preferably be the VO Admin) has no CERN login account,
the e-group should NOT be created at CERN as external e-group owners are not allowed
(see Response in
https://remedy01.cern.ch/cgi-bin/consult.cgi?caseid=CT0000000661882&email=maria.dimou@cern.ch
).
The e-group
yourVOname-grid-support-admin@cernNOSPAMPLEASE.ch should also be created to contain the administrators of the SU e-group.
These will be the SU requester, one or more backup people and the GGUS team via the inclusion of the existing e-group 'ggus-lists-admins'.
If the requester provides a mailing list for receiving ticket notification emails he has to make sure that GGUS is allowed to post to this list.
GGUS uses two different mail addresses:
helpdesk@ggus.org is used for ticket related emails,
support@ggus.org is used for contacting the GGUS developers.
It is your responsibility to maintain your VO-ID card on the Operations' portal up-to-date with valid information about the supporters of your VO.
The GGUS development team will open, on your behalf a request in the savannah project at
https://savannah.cern.ch/support/?func=additem&group=esc
to plan the effective date of your VO integration as a production GGUS SU. You will receive an automatic notification with the savannah ticket corresponding
to your case. If you have questions about this process please contact
ggus-info@cernNOSPAMPLEASE.ch
All this information is also included in
https://wiki.egi.eu/wiki/FAQ_GGUS-New-Support-Unit
Thank you!
ATLAS-IT meeting on ALARMs from SNOW 20121001
This was the start of multiple meetings and presentations on this issue. The idea to raise ALARMs
from SNOW was abandoned, in favour of the well-tested GGUS entry point. Matter is on-going. Progress is recorded in
Savannah:132582
Description
Partial Failure of the Name Service for .de domains on 20100512 between 13:30 and up to 17:45 (CEST) for some sites, due to caching by the ISPs, even if the problem was solved by 15:45 according to DENIC, the company hosting the .de Top Level Domain.
Here is a DENIC press release
.
Impact
This incident prevented access to GGUS as the
http://ggus.org
domain is today a re-direct to
https://gus.fzk.de
. This means, like in any incident causing GGUS unavailability, that ticket submission/update and site notification was impossible. This is particularly critical in case of ALARMs. It also presumably, prevented data transfer to the
GridKa Tier1 (did it?).
Analysis
Currently
http://www.ggus.org
, the 'promoted' URL is just a re-direct to
https://gus.fzk.de
. As the institute is now renamed to kit.edu, GGUS plans to fully move to
http://www.ggus.eu
soon (no redirect anymore). It is suggested that, as a fallback,
http://ggus.org
be also registered and remain in operation at all times.This would produce a certificate warning in the browser, but it would still work. This solution would bypass the problem of new DNS propagation delay (up to one day) when the same domain name mapping is changed to correspond to different IP addresses. If both top level domains fail at the same time, using the IP address would be the solution. Emergency broadcast procedures should be rehearsed to propagate such information to all interested parties. A detailed plan on how to implement, configure, test and rehearse such a fall-back must be documented by the GGUS developers in
https://savannah.cern.ch/support/?113831
. Similar solutions must be put in place for the Operations' portal, GOCDB and OSG OIM.
Timeline
WHEN CEST |
WHAT |
2010/05/12 13:30 : |
Users sometimes receive the incorrect reply “domain does not exist” for domains under .de. |
2010/05/12 14:20 : |
DENIC technicians diagnose an incomplete update of the name service data (zone file) at 12 of the 16 service locations and switch off those locations to avoid faulty responses. |
2010/05/12 14:30 : |
The switched-off locations are successively provided with a complete zone file and re-integrated into the name server network. |
2010/05/12 15:45 : |
Worldwide DNS data re-distribution of these locations completed by DENIC. |
2010/05/12 17:45 : |
End of all possible disturbances perceived by Internet users due to caching by the ISPs. This end-of-incident timestamp is approximate. |
SIRGGUSfailure20101116
Description
Total unavailability of the GGUS web interface between ~08:15 and ~11am CET. This SIR will be presented at the 20101202
T1SCM.
Impact
This incident prevented access to GGUS. Every page behind
http://ggus.org
was hanging. This means, like in any incident causing GGUS unavailability, that ticket submission/update and site notification was impossible. This is particularly critical in case an ALARM ticket submission is needed during the incident.
Analysis
A problem with the mail parser was already observed the day before. GGUS developer Guenter Grein notified the TPM e-group on 2010/11/15 at 16:53. It is unclear if this is related to the 20101116 incident (web pages hanging). A possible cause may be the simultaneous submission of multiple tickets to the spanish ticketing system, done by the TPM on 2010/11/16 around 8:30am which caused the SOAP component to freeze and entail a blockage of all webservices.
GGUS developers tried to reproduce the problem on the training system by doing multiple updates to the Spanish test system without provoking the problem.
The logs nor on GGUS side neither on the Spanish side contain any hint about the outage.
Hence the outage isn't understood up to now.
Timeline
WHEN CET |
WHAT |
2010/11/16 8:46 |
Multiple reports by Atlas shifters (Jaroslava) and supporters (Alessandra F.)that they can't open GGUS web pages. Developers, already aware of the problem investigating. |
2010/11/16 9:20 |
As the problem persists, Guenter emails the ggus-info e-group. |
2010/11/16 9:33 |
MariaD forwards the information to relevant WLCG e-groups. |
2010/11/16 9:54 |
MariaA forwards the information to the middleware developers' e-group. |
2010/11/16 10:10 |
Guenter announces the web interface is back, mail parser still down. |
2010/11/16 11:59-12:22 |
IN2P3 ticketing system interface developer reports still an error requesting GGUS webservice => Server returned HTTP response code: 503 for URL: https://gusiwr.fzk.de/arsys/services/ARService?server=gusiwr&webService=CIC_HelpDesk . Problem disappeared 20 minutes later. |
SIRGGUSfailure20101126
Description
GGUS database unavailability between ~13:00 and 14:30 CET. This SIR will be presented at the 20101202
T1SCM.
Impact
This incident prevented access to GGUS (homepage and tickets' view, submission). This, like all GGUS unavailability is particularly critical in case of ALARMs.
Analysis
As soon as Atlas (Jaroslava) reported the unaccessibility to the GGUS pages, Guenter responded within 5 minutes and started investigation. The problem was quickly identified as database-related. 1.5 hours later the following diagnostic was published via email and at the WLCG daily meeting on the same day: _The reason for the outage was an unscheduled reboot of the Oracle DB cluster failing for all nodes. The DB reboot was caused by a combination of hardware failure and miss-configuration. The hardware failure made the DB cluster losing some essential control files. The miss-configuration prevented the cluster reboot.
Timeline
WHEN CET |
WHAT |
2010/11/26 13:11 |
Jaroslava reports one can't open ggus web pages. |
2010/11/26 13:17 |
Guenter starts investigation. |
2010/11/26 13:22 |
Guenter informs the tpm and ggus-info e-groups that GGUS is down. First estimate: database problem. |
2010/11/26 13:26 |
David Crooks from the Glasgow Grid site submits the error message he gets: No DB connect ...please try later or contact webmaster.... |
2010/11/26 14:37 |
Guenter announces GGUS service available again. |
Submitted by Helmut, now attached to
https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20110727GGUS_Service_Incident_Report.pdf
SIRDNSfailure20111006
Description
Partial Failure of the Name Service at KIT on 20111006 11:15 up to 20111007 12:00 (CEST) for some SOAP interfaces, due to an update of the KIT IPS system (intrusion prevention system).
Impact
This incident breaks some SOAP based interfaces of GGUS e.g. CERN, NGI_IT, NGI_GRNET, i.e. for ~24 hours GGUS tickets couldn't create/update their peers in the those other local ticketing systems to which they interface using web services.
Analysis
The administrators of the IPS didn't broadcast the update as they didn't expect impacts on any system. Hence it took the GGUS developers some hours finding out the reason for the SOAP problems.
SIRSNOWinterfacefailure20111104
Description
Failure of the wrapper tool used for interfacing GGUS with SNOW on 20111104 16:15 up to 2011107 11:00 (CET).
Impact
This incident breaks the interfaces to SNOW, NGI_France and ROC_Russia.GGUS tickets couldn't create/update their peers in the other ticketing systems.
Analysis
There was an issue with the outgoing mail server at KIT last Thursday (2011-11-04) and Friday (2011-11-05).
Not being aware of the mail server issue the GGUS team tried to fix the mail problems by restarting the Remedy system. Unfortunately the Remedy server didn’t load the correct environment during the reboot due to not using the root account for the reboot. This lead to a failure of all scripts used in GGUS and triggered by the Remedy server such as the SNOW wrapper and some others.
The startup script of the Remedy server has now been changed forcing using the environment settings of the root account for rebooting the server.
SIRGGUSunreachable20120320
Description
The DNS update was not propagated to all DNS servers in a timely manner on 20120320 13:00 up to 20120321 11:00 (CET).
Impact
GGUS was not reachable for some regions depending on the frequency of DNS cache updates.
Analysis
With the release on March, 20th 2012 GGUS changed the server infrastructure for the web front-end from running only one server to running two servers in parallel. The IP address of the previous standalone server was disabled while the new parallel servers IPs have been enabled during the release.
The network team at KIT wasn't aware of the appropriate way to set the TTL value for the external DNS service. Therefore some regions could access GGUS as usual. Others like CERN could not.
For those who couldn't access GGUS after the release a DNS update forced by their local network administrators solved the problem.
SIRGGUSunreachable20120522
Again a DNS update related problem. In case the WLCG SCOD doesn't require a formal SIR, here is the info about what happened for the record:
https://savannah.cern.ch/support/?113831#comment40
SIRGGUSunreachable20120701
One of the troubled services by the leap second of June 30th. Reported in
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek120701#Tuesday (within the ATLAS report): GUUS issue Sun to Mon due to leap second (Full explanation by GGUS developer Guenter Grein: "The problem started after introducing a leap second in the night 20120630 to 20120701. It ended Monday morning 09:30 after restarting not only the Remedy server but also the underlying OS. Especially the Java based Remedy mid-tier was facing problems which increased with the rising number of accesses to the GGUS web portal this morning. Several publications in the www report problems of Linux kernels and Java applications related to the leap second. We believe this has caused the GGUS problems too. The GGUS server is running on a Linux OS and a major part of Remedy is Java-based.")
Aggregation of monitoring information
Notes from the 2010/09/28 CERN/IT/ES meeting
http://indico.cern.ch/event/Monitoring20100928
. In this table 'Y' = Yes, the experiment has it. Comments by Julia, Federico, Nicolo', Patricia included. Last update 20101013.
Tool\VO |
Alice |
ATLAS |
CMS |
LHCb |
Is the tool private to the VO? |
Does it have a programmatic interface? |
Comments |
Lemon->SLS (GridMap) |
Y |
Y |
Y |
Y |
N |
Y |
|
SAM->Nagios |
Y |
Y |
Y |
Y |
N |
Y |
Nagios only works for the CREAM CE with a special submission mode not used by most VOs |
MonALISA |
Y |
N |
N |
N |
full suite used by Alice only but framework used by other tools too |
Y |
Discrepancy with GridView! |
GridView |
Y |
Y |
Y |
Y |
Y |
Y |
Discrepancy with MonALISA |
Database tools |
N |
Y |
N |
N |
? |
? |
DDM (Data Distribution Monitor) |
N |
Y |
N |
N |
Y, made by IT/ES/DNG for ATLAS |
Y |
Many DDM tools for data replication and more |
SSB (Site Status Board) |
Y |
Y |
Y |
Y |
N, made by IT/ES/DNG |
Y |
Pending discussion for the best use of SSB by Alice |
Dashboard job monitoring |
N |
Starting |
Y |
N |
N |
Y |
Provides job monitoring data for all 4 VOs, not used by Alice and LHCb now |
Dashboard SAM usability monitor |
Y |
Y |
Y |
Y |
N |
Y |
Will be replaced when SAM will become NAGIOS only |
Downtime Calendar |
N but need one and wishes to discuss its required features |
Y (google) |
Y (google) |
A calendar tool based on downtime notifications and google now being replaced by a GOCDB local cache of LHCb -supporting-sites only with their DIRAC names |
Implementation differences are normal due to topology |
(GOCDB and OIM imports) |
Wish for common info collector across VOs |
Panda Analysis Monitor |
N |
Y |
N |
N |
Y (ATLAS) |
Y |
N |
Hammercloud (site stress testing) |
N |
Y |
Y |
Y |
N (tool made by IT/ES/VOS) |
Y |
Talk to the developers for use by other tools |
PhEDEx (transfer monitoring) |
N |
N |
Y |
N |
Y (developed by CMS with support from IT-ES-VOS) |
Y |
N |
RSS (Resource Status System) |
N |
N |
N |
Y |
Y, LHCb |
Y |
N, a DIRAC service |
VO comments/wishes |
Solve the MonALISA/GridView issue, identify more common tools |
Need FTS monitor, need network monitor, wishes common downtime info collector across experiments |
Same as ATLAS + Storage accounting |
Happy with RSS but also need FTS and network monitoring tools |
N/A |
N/A |
N/A |
EGI SA3 Metrics
Metric ID: M.SA3.16 / Task: TSA3.3 Project Year 1
Definitions as per
Savannah:118984#comment0
Total number of tickets provided for the 4 LHC VOs.
Metric ID: M.SA3.16 / Task: TSA3.3 Project Year 2
Definitions as per
Savannah:118984#comment5
Only True ALARM ticket totals now required for the 4 LHC VOs as a result of the reviewers' report for the 1st year of the EGI project.
Grid Information Sources
As 1st-line VO supporters, please consult these links which might contain an announcement that explains a problem your users are facing:
-- Main.dimou - 23 Apr 2007