SAM Analysis Reporting
Introduction
The members of IT-SDC-MI participate in WLCG Operations by investigating SAM test failures, attending WLCG Daily Operations Meetings, and providing a weekly Service Availability Analysis report to WLCG Service Coordinators. This document describes how to perform that task.
WLCG Daily Meeting
These are held at 15.00 CE(S)T Monday and Thursday in CERN
513
R-068. IT-SDC-MI representatives attend following the weekly support rota. The minutes are here
WLCG Daily Operations Meetings Minutes.
Daily - Investigating SAM Test Failures
And the 48 hour view can be useful too:
-
as of 2nd April 2012 ATLAS, CMS and LHCb are pointing to test instance of SUM, ALICE is pointing to production instance of SUM
- Investigate any non-green cells by drilling down to the failures.
- Note which tests are failing and where.
- Note any corresponding downtime from GOCDB
. (For US sites see also OIM
).
- Note any related GGUS ticket from GGUS
. (For CMS see also CMS CIS Savannah
).
- Hints:
- You may find details about issues, downtimes, tickets on the WLCG Operations Meeting minutes twiki: WLCGOperationsMeetings.
- If you can't find any downtime or GGUS ticket to explain an issue then ask the VO contact in our group who may answer your question or pass it on:
- ALICE: CE tests are only relevant to ALICE at CERN (elsewhere they use CREAMCE) so if any cell is non-green just because of failing CE tests then you should add a green box for it.
GOCDB entries
Daily - Attend WLCG Operations Meetings
- The VOs report issues of the previous day.
- If you have found an issue that is not reported or covered by a downtime then you should mention it.
- The Sites report issues of the previous day.
- Listen out for any downtime or issue announcements that might affect SAM tests in the coming days.
- The Services report issues of the previous day.
- Listen out for any downtime or issue announcements that might affect SAM tests in the coming days.
- Make any downtime or issues announcements on behalf of the Dashboard team.
Monday - Write The Reliability Report
-
As of mid March 2012 we should compile Reliability reports (as oppose to Availability reports compiled before).
- Extract the weekly quality plots for each VO.
- ALLVO-SUM
- File -> Save Page As... (Web Page, complete).
- Copy png files from
AllVOs_files/historicalsiteavailability_data*
folders
- Create report
- Copy previous report, e.g. report-110117.ppt (available from previous SAM Analysis Reporter).
- Insert new images into report (ATLAS, ALICE, CMS, LHCb)
- Add numbers to any cell which is not green (0.x All VOs, 1.x ATLAS, 2.x ALICE, 3.x CMS, 4.x LHCb)
- It's a bit fuzzy when a cell is pale-green or yellow so just compare it to previous reports.
- For purely monitoring issues, use a numbered green box to cover the affected cell.
- Add daily report notes to page 2, referencing the numbers.
- Send report before 14:00 CET.
To: wlcg-scod@cern.ch, Julia Andreeva
Subject: Reliability Report for the Week of 110117
Dear all,
Please find attached the reliability report for the week of 110117.
Best regards,
John Smith
SAM Visualization Gotchas
- On the ALLVO-SAM
page, the clickable areas do not line up with the cells, and you should open links in a new tab (since the overview page uses iframes).
- Sometimes when you access a page for a given VO it reports for a different VO. (Just refresh the page until it reports for the desired VO).
- The form on the “testhistory” view will not remember which instance you selected. (You have to reselect it if you want to resubmit the form).
- When you drill down to the “testhistory” view from the “historicalserviceavailability” view, you will see all tests, not just critical tests. To see only critical tests, resubmit the form by clicking “Show Results”. (Make sure you have reselected the instance, see note above).
- The above trick does not work on CMS SAM visualization. In this case, you have to click on "Latest Results" and select "VO Critical" under "Service Types" to view the list of critical tests in the "Test Types" selection box.
- If the monitoring in the SAM visualization looks a bit suspicious then check the original (non-Dashboard) SAM interface here: LCG-SAM
Tips for CMS tests
- Andrea's suggestion: Tests that start with CE-org.cms... are not relevant since are from Nagios
Tips for Atlas tests
- To crosscheck site availability and dashboard monitoring the Grid View
tool can be used (example here
from a Taiwan-LCG2 bad day from dashboard monitoring).
--
DavidTuckett - 24-Jan-2011