What is SAM for?

SAM jobs perform the following actions:

  • Check that job submission to the site batch system is working (i.e., if the SAM job runs, it's working).
  • Install the latest version of the LHCb software on the shared software area at each site.
  • Run some test Gauss, Boole, Brunel and DaVinci application to check that the software has been installed properly.

A failure in any one of these steps will lead to the SAM test failing. Grid Experts and Grid Shifters should investigate the reasons for failures and take the appropriate action.

How can SAM tests fail?

Failures can come from many sources:

  • There is a problem with the site batch system.
  • The LHCb SAM tests are using up too many site resources (i.e. CPU and memory) and are killed by the batch system. This is unlikely, but could happen.
  • The site has the wrong permissions on the shared software area, meaning that the software can't be installed, or the user is incorrectly mapped.
  • The software fails to install. This may be due to a bug in the installation script or due to a problem with the site setup. For example, they may be using an unusual filesystem like GPFS for the shared software area which results in the install failing.
  • The application fails to complete. This could be due to the software not being installed properly (i.e. DaVinci.exe not found) or could be due to a real bug in the application code (i.e. ROOT seg-fault).

What action should be taken?

Depending on the failure, the Grid Expert or shifter will have to react differently. Information that should be used is:

  • SAM results page
    • jobID and DIRACsetup
    • LogSE directory
    • DIRAC site name
  • WMS job information
    • Sandbox (std.out and std.err)
    • Job logging parameters
    • Job parameters

You should definitely look at using this tool to investigate the problem.

Grid Experts can submit a new SAM test

ProductionProceduresSamTestSubmit provides instructions for Grid Experts. They can target a site with a SAM job and clean up the shared area.

Site problem

A GGUS ticket should be submitted with a description of the failure. Mark as Critical.

SAM test problem

If the software installer is failing, send an email to lhcb-grid@cernNOSPAMPLEASE.ch and submit a bug in Savannah.

Application problem

If the software is misbehaving, then find out which version of the application it is and report the problem to lhcb-grid@cernNOSPAMPLEASE.ch. The software may have to be reinstalled.

Should I ban the sites?

If it is clear that the problem is something which will cause all subsequent jobs to fail, you should ban the site. Check first if the site is banned and what the reason for the ban was.

$ dirac-admin-get-banned-sites

If it is not already banned, do this:

$ dirac-admin-ban-site LCG.RHUL.uk --comment='wrong permissions on shared software area'

If the problem is with the application, then put in the error message from the application log file.

$ dirac-admin-ban-site LCG.RHUL.uk --comment='Brunel.exe not found'

!!!Remember to UNBAN a site once the problem is resolved!!!

-- GreigCowan - 21 Sep 2008


This topic: LHCb > WebHome > LHCbComputing > ProductionProcedures > ProductionProceduresSamFailures
Topic revision: r1 - 2008-09-21 - GreigCowan
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback