SAM Analysis Reporting

Introduction

The members of IT-SDC-MI participate in WLCG Operations by investigating SAM test failures, attending WLCG Daily Operations Meetings, and providing a weekly Service Availability Analysis report to WLCG Service Coordinators. This document describes how to perform that task.

WLCG Daily Meeting

These are held at 15.00 CE(S)T Monday and Thursday in CERN 513 R-068. IT-SDC-MI representatives attend following the weekly support rota. The minutes are here WLCG Daily Operations Meetings Minutes.

Daily - Investigating SAM Test Failures

And the 48 hour view can be useful too:

  • NEW as of 2nd April 2012 ATLAS, CMS and LHCb are pointing to test instance of SUM, ALICE is pointing to production instance of SUM
VO Reliability 48 Hour view (SUM) Reliability Weekly SUM Availability 48 Hour view (SUM) Availability Weekly SUM
ATLAS Plot Plot Plot Plot
Alice Plot Plot Plot Plot
CMS Plot Plot Plot Plot
LHCb Plot Plot Plot Plot
  • Investigate any non-green cells by drilling down to the failures.
  • Note which tests are failing and where.
    • Example:
      LHCb
      ---- 
      4.x IN2P3: CE-sft-vo-swdir test failing (Timeout after 150 seconds). Known shared area issue.
      
  • Note any corresponding downtime from GOCDB. (For US sites see also OIM).
    • Example:
      ATLAS 
      ----- 
      1.x Green box for RAL: Scheduled downtime 17 January 08:00 to 18 January 17:00. 
      https://next.gocdb.eu/portal/index.php?Page_Type=View_Object&object_id=24047&grid_id=0
      
  • Note any related GGUS ticket from GGUS. (For CMS see also CMS CIS Savannah).
    • Example:
      LHCb 
      ---- 
      4.x NIKHEF: CE-sft-vo-swdir test failing intermittently (Timeout after 150 seconds). 
      This is consistent with the 10% failure rate observed for MC jobs at NIKHEF setting up environment using CERNVMFS. 
      Related GGUS ticket: https://gus.fzk.de/ws/ticket_info.php?ticket=66287
      
  • Hints:

GOCDB entries

Site
ASGC
FZK-LCG2
IN2P3-CC
INFN-T1
NIKHEF-ELPROD
RAL-LCG2
PIC

Daily - Attend WLCG Operations Meetings

  • The VOs report issues of the previous day.
    • If you have found an issue that is not reported or covered by a downtime then you should mention it.
  • The Sites report issues of the previous day.
    • Listen out for any downtime or issue announcements that might affect SAM tests in the coming days.
  • The Services report issues of the previous day.
    • Listen out for any downtime or issue announcements that might affect SAM tests in the coming days.
  • Make any downtime or issues announcements on behalf of the Dashboard team.

Monday - Write The Reliability Report

  • NEW As of mid March 2012 we should compile Reliability reports (as oppose to Availability reports compiled before).
  • Extract the weekly quality plots for each VO.
    • ALLVO-SUM
    • File -> Save Page As... (Web Page, complete).
    • Copy png files from AllVOs_files/historicalsiteavailability_data* folders
  • Create report
    • Copy previous report, e.g. report-110117.ppt (available from previous SAM Analysis Reporter).
    • Insert new images into report (ATLAS, ALICE, CMS, LHCb)
    • Add numbers to any cell which is not green (0.x All VOs, 1.x ATLAS, 2.x ALICE, 3.x CMS, 4.x LHCb)
      • It's a bit fuzzy when a cell is pale-green or yellow so just compare it to previous reports.
      • For purely monitoring issues, use a numbered green box to cover the affected cell.
    • Add daily report notes to page 2, referencing the numbers.
  • Send report before 14:00 CET.
    To: wlcg-scod@cern.ch, Julia Andreeva
    Subject: Reliability Report for the Week of 110117
    
    Dear all,
    
    Please find attached the reliability report for the week of 110117.
    
    Best regards,
    John Smith
    

SAM Visualization Gotchas

  • On the ALLVO-SAM page, the clickable areas do not line up with the cells, and you should open links in a new tab (since the overview page uses iframes).
  • Sometimes when you access a page for a given VO it reports for a different VO. (Just refresh the page until it reports for the desired VO).
  • The form on the “testhistory” view will not remember which instance you selected. (You have to reselect it if you want to resubmit the form).
  • When you drill down to the “testhistory” view from the “historicalserviceavailability” view, you will see all tests, not just critical tests. To see only critical tests, resubmit the form by clicking “Show Results”. (Make sure you have reselected the instance, see note above).
  • The above trick does not work on CMS SAM visualization. In this case, you have to click on "Latest Results" and select "VO Critical" under "Service Types" to view the list of critical tests in the "Test Types" selection box.
  • If the monitoring in the SAM visualization looks a bit suspicious then check the original (non-Dashboard) SAM interface here: LCG-SAM

Tips for CMS tests

  • Andrea's suggestion: Tests that start with CE-org.cms... are not relevant since are from Nagios

Tips for Atlas tests

  • To crosscheck site availability and dashboard monitoring the Grid View tool can be used (example here from a Taiwan-LCG2 bad day from dashboard monitoring).

-- DavidTuckett - 24-Jan-2011

Edit | Attach | Watch | Print version | History: r39 < r38 < r37 < r36 < r35 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r39 - 2013-09-10 - DavidTuckett
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ArdaGrid All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback