----------------------- CMS page -----------------------

Checklist for SAM tests

In the following, a procedure to check the SAM tests on a collection of CMS sites is described, together with the actions that have to be taken in case of failures.

Checklist

Preliminary steps

  1. Check out details on the sites you may want to check in SiteDB .
  2. Go to the CMS SAM dashboard page.
    • NOTE: May not display properly on browsers other than IE and Firefox
  3. Click on "Summary View of latest Results".

CE tests

SAM-1.jpg

  1. In "Sites", select the desired site(s)
    • Note: the site names are the Tier names according to the new convention, namely T{0,1,2,3}_[region]_[sitename]
    • Note: as usual, use the CTRL or SHIFT keys for multiple selections
  2. In "Service Types", select "VO critical" and "CE".
  3. In "Test Types", select "CMS Tests" and "Select all".
  4. In "Test Exit Status", select "All Exit Status".
  5. Click on "Show Results".
  6. For each CE instance, do the following:
    1. check the colour of the tests (a self-explanatory color code legenda is displayed on the page);
    2. check the age of the tests
      • dark colours mean the test is recent, lighter colors mean the test is older; hovering the mouse over a cell shows the execution time of the latest test
    3. check the exit status of the tests. If a test is in status warn or error, click on it and examine the test output and try to understand why it failed;
    4. in case of error, see if it could be explained by a downtime;
    5. click on the host name of the CE to see the history of the last 48 hours; investigate failures preceding the last test results, using the historical view, if you had not already done so;
    6. notify the site of any error in a CE, and of warnings that indeed do deserve the attention of the site-admins.
    7. notify the SAM squad if a CE test has not been run during the last 24 hours, if you think that a test failure is not really site-related, or if you have no idea what the reason is.

SRM(v1) tests

SAM-2.jpg

Repeat the procedure for the CE tests but select " SRM" instead of "CE" in "Service Types".

SRMv2 tests

SAM-3.jpg

  1. In "Sites", select the desired sites.
  2. In "Service Types", select "All Service Types" and "SRMv2" (not "SRM2"!).
  3. In "Test Types", select "All Tests" and these tests: SRMv2-get-pfn-from-tfc, SRMv2-lcg-cp, SRMv2-lcg-gt-rm-gt, SRMv2-lcg-gt, SRMv2-lcg-ls-dir and SRMv2-lcg-ls.
    • NOTE: currently the Dashboard does not show correctly the test names in the column headers of the table
  4. In "Test Exit Status", select "All Exit Status".
  5. Click on "Show Results".
  6. From this point on, proceed like for the SRM tests.

Procedures

Add people to the "CMS Computing Infrastructure Support" Savannah group

This section explains how to add a person to the "CMS Computing Infrastructure Support" Savannah group. NOTE: This information is relevant to SAM shifters only, and only Savannah group admins have the privileges to manage members.
  1. Check that the person is indeed missing in the Savannah project: view the memberlist
  2. Go to the main page of the Savannah project, use the "Search" feature on the left-hand side to look for the person: type the surname, select "in People", click "Search"
    • If the person does not have a Savannah id, send a mail to this person and ask to create one here
    • If the person has a [savannah-id] in any Savannah project, a page like https://savannah.cern.ch/users/[savannah-id] will be opened: do grab his/her [savannah-id]
  3. Go to the useradmin section, scroll to the bottom, type the [savannah-id] of the person in the "Adding users to group" interface, click "Search users", select the row found in the search, click "add users to group"
  4. Verify the person has been properly added by checking the memberlist
Once properly done, his/her [savannah-id] will be shown in the "Assign to" pop-up window when you open a Savannah ticket, so now you can assign tickets to this person within the "CMS Computing Infrastructure Support" Savannah project.

Interpret SAM tests

The outcome of SAM tests should be interpreted based on this information:
  • the test documentation, accessible by clicking on the test name in the column header in the SAM page, or from here;
  • the troubleshooting guide.

Intermittent failures

If a test fails only intermittently, report it only if the failure rate if above approximately 10%.

Use the historical summary view

It is possible to get the test history for any service instance within any given time interval, with interactive access to previous test results. This is the procedure:
  1. Click on "Historical Summary View"
  2. In "View", select "Test History"
  3. In "Time Range", select the desired time range
  4. In "Sites", select the desired sites
  5. In "Service Types", select the appropriate service type
  6. In "Services", select the desired instance
  7. In "Tests", select the desired tests
  8. Click on "Show Results"
In the plot produced, you can click on any rectangle to see the output of the last test executed at that point in time.

Notify a site

If you think that a SAM test failure is clearly due to a site problem, like in these cases
  • the CMS software installation area is unreachable
  • local site configuration and TFC are incomplete or wrong
  • the local storage does not work
  • the CE is not able to run the SAM jobs
you should submit a Savannah ticket in the category SAM tests and assign it to:
  1. a specific person, if you think you know who should be informed;
  2. a specific person listed in the SiteDB as Site Admin;
  3. cmscompinfrasup-facilities, if you do not know to whom you should assign the ticket.

GGUS tickets

If the problem is clearly related to the site infrastructure, for EGEE sites it is recommended to create a GGUS ticket and reference it in the Savannah ticket; this may speed up the resolution of the problem.

Notify the SAM team

If you think that a SAM test problem is due to any reason independent from the site itself, like for example (but not only):
  • the test is not being run anymore
  • the test code has bugs or it does not faithfully represent the real status of the service
  • a test job aborts for proxy expiration
  • or you just do not have a clue about the reason
you should submit a Savannah ticket in the category SAM tests and assign it to cmscompinfrasup-sam. The better you specify all your observations and checks, the fastest response you will have.

Check if a site is in downtime

For EGEE sites it is possible to check if a site is in a downtime. There are two ways to do it:
  • check the status of the js and jsprod tests for the CE. The status is maint when the site is in a downtime;
  • go to the GOCDB site, search for the site you are checking, using its BDII name, and scroll down until the EGEE downtimes to see if the site is currently down.

Find out the BDII name of a site

If you know the CMS name, but not the BDII name (or SAM name, it is the same thing) of a site, use this SiteDB report.

To do

Every single site, which fails the SAM tests, is now an enemy of CMS.

Chancellor Palpatine

-- Main.AndreaSciaba - 17 Mar 2008 - Mods: DB, 17/3 - AS/DB 18/3 - DB 25/3

-- Main.AleDiGGi - 13 May 2008

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-11-11 - MatthiasStein
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback