WLCG GGUS Operations
Every day
Scan through the GGUS notifications in your inbox. They may concern GGUS tickets for:
- ROC_CERN, i.e. Grid Services. There it is important that the GGUS-SNow interface works well and the right supporters follow up the tickets with a response time relevant to the ticket priority and type. ALARM tickets should be in the hands of the experts within less than 1 hour.
- The
GGUS
Support Unit (SU), i.e. incidents and requests related to the GGUS infrastructure itself.
- Any other SU to which you belong. There you typically would act as a supporter.
- Any GGUS ticket that was brought up at the WLCG Operations (Coordination) meetings as having a problem in routing or response.
- Investigate user complaints for any email you receive as member of the
ggus-escalation-notifications
e-group, with Subject: REMINDER Escalation Level X.
In case of GGUS downtime
The e-group
ggus-downtimes
contains four sub-e-groups, named
ggus-downtimes-VOname
(VOname = alice | atlas | cms | lhcb). When the GGUS developers publish a downtime (scheduled or not) in GOCDB they should email the top-level e-group in addition. The sub-e-group members within the experiments decide whom to inform in their community.
Every Monday
Update the GGUS section in the appropriate page under
WLCGOperationsMeetings with announcements of upcoming releases or debug info for relevant issues, if any. Participate in this meeting. If you can't be there, please read the notes from the meeting, in case there are GGUS tickets wrongly assigned or not properly followed up. There might also be new development requests, problems with the SNow or OSG interfaces, misunderstandings concerning the workflows, TEAM or ALARM ticket creators in the experiments who lost their privileges etc.
Before the WLCG MB (on Monday)
Prepare the graph of tickets
Update the file
ggus-tickets.xlsx
. Download the latest version from the
WLCGOperationsMeetings page, where it should be permanently attached. This file contains weekly summaries so that the corresponding graph will show GGUS ticket traffic evolution over regular intervals. You need to cover the period from the Monday before the previous MB up to the Monday preceding the current MB, one week at a time. Open the
ggus-tickets.xlsx
file and:
- move the mouse to the bottom right corner of the table where you will see a small mark;
- click on that mark and drag the mouse downward to extend the table;
- please add exactly the number of rows that are needed to cover the weeks for your report.
The dates and the totals per experiment will be filled in automatically. When needed, please move the graph area downward to make space for the additional rows.
The
GGUS Report Generator
is used to obtain the numbers for each additional row, per experiment. Instructions:
- Open the GGUS Report Generator
(full documentation here)
.
- Select period from
Monday-week-previous-MB
to Monday-week-current-MB
- Select the 4 LHC VOs and click on
Group by
.
- Select ALL ticket types and click on
Group by
.
- Select ticket categories
Incident
and Service Request
.
- Select weekly aggregation.
- Click
GO!
- Write the totals of each week in your local copy of file
ggus-tickets.xlsx
.
- To help avoid mistakes, compare the value in each column with the ones directly above:
big changes in any of the columns ought to be rare.
- Pay special attention to alarm tickets, see below.
When all weeks have been done, update the
ggus-tickets.xlsx
attachment on the
WLCGOperationsMeetings page.
Mind you will need the graph for the MB Service Report, as documented below.
Alarm tickets need special attention:
- Test alarms (e.g. accompanying GGUS releases) are not always marked
Test
.
- Use the GGUS search engine
to list all alarm ticket candidates:
- For
Special attributes
select ALARM-Tickets
.
- For
Status
ensure all
is selected.
- For
Ticket category
select Incident
.
- Select the appropriate time period and click
GO!
- Check the subjects of the list of tickets shown: test cases should be obvious
and must neither be included in ggus-tickets.xlsx
nor in the MB report.
- Each real alarm ticket should be briefly described in the MB report,
usually in the operations section for the affected experiment and/or site.
Then, when all the relevant weeks have been done individually, to simplify filling out the table on the GGUS slide (see below):
- Select period from
Monday-week-previous-MB
to Monday-week-current-MB
.
- Select yearly aggregation.
- Click
GO!
- Write the totals per ticket type for each experiment in the table on the GGUS slide, see below.
Prepare the slide for the MB
- Use the template attached to this page to make the slide for the service report.
- Include the graph from the latest
ggus-tickets.xlsx
attached to WLCGOperationsMeetings. Beware: the legend info must be complete (all 4 experiments) and readable. NOTE: this may need to be done on a Windows host (e.g. the Windows Terminal Service cernts.cern.ch
, available e.g. through the MS Remote Desktop client), because on MacOS the graph legend date format may not work! A workaround would be to import the graph as an image instead of an Excel workbook.
Around GGUS release dates
- On Monday at 3pm two days before: announce the upcoming release in the WLCG Operations meeting minutes. Emphasize any important upcoming changes listed in the release notes
.
- Assist the GGUS team with the follow-up of problematic test alarm tickets, if needed.
About the GGUS-SNow interface
Although the mappings were agreed in January 2011, the interface has suffered from unilateral Snow changes for which GGUS was given no advance notification.
Documentation:
About the GGUS Architecture
Historic documentation:
Before the end-of-year break
Publish this text in the weekly operations meeting:
For the end-of-year break: GGUS is monitored by a system connected to the on-call service. In case of total GGUS unavailability the on-call engineer (OCE) at KIT will be informed and will take appropriate action. If GGUS is available but there is a problem with the workflow (e.g. ALARM to CERN doesn't generate email notification to the operators), then WLCG should submit an ALARM ticket, notifying site FZK-LCG2 (DE-KIT)
, which triggers a phone call to the OCE. As a last resort, the FZK-LCG2
emergency e-mail or telephone number published
in the GOCDB can be contacted.