SAM Overview

Introduction

SAM (Service Availability Monitoring) is a framework for the monitoring of production and pre-production grid sites. It provides a set of probes which are submitted at regular intervals, and a database that stores test results. In effect, SAM provides monitoring of grid services from a user perspective.

SAM components

The SAM testing and monitoring framework is made up of several components:

lcg-sam-jobwrapper - SAM jobwrapper
lcg-sam-client - SAM Client
lcg-sam-client-WNconfig-EGEE - SAM Client configuration for EGEE production grid with publishing to production DB
lcg-sam-client-WNconfig-EGEEval - SAM Client configuration for EGEE production grid with publishing to validation DB
lcg-sam-client-sensors - SAM Client Sensors
lcg-sam-server-db - Service Availability Monitoring Database
lcg-sam-server-portal - Service Availability Monitoring Web Portal
lcg-sam-server-ws - SAM query web service
lcg-sam-server-xsql - XDK/XSQL based web interface to SAM DB

The SAM database

Structure of the database

SAM uses an Oracle database to store the test definitions, node and site information, and the test results. The database contains 248 different user objects (in the Oracle sense), but here we describe only the simplified structure of the SAM database:

  • The service table defines a list of services which are 'monitorable' by SAM :
    SQL> describe service
     Name                                      Null?    Type
     ----------------------------------------- -------- ----------------------------
     SERVICEID                                 NOT NULL NUMBER(38)
     SERVICESCOPE                              NOT NULL CHAR(1)
     SERVICEABBR                               NOT NULL VARCHAR2(8)
     SERVICENAME                               NOT NULL VARCHAR2(100) 
    
  • The nodes table defines the nodes to be monitored:
    SQL> describe node;
     Name                                      Null?    Type
     ----------------------------------------- -------- ----------------------------
     NODEID                                    NOT NULL NUMBER(38)
     SITEID                                    NOT NULL NUMBER(38)
     TIMESTAMP                                 NOT NULL NUMBER(38)
     ISDELETED                                 NOT NULL CHAR(1)
     NODENAME                                  NOT NULL VARCHAR2(255)
    

  • The serviceinstance table defines an instance of a service:
    SQL> describe serviceinstance;
     Name                                      Null?    Type
     ----------------------------------------- -------- ----------------------------
     SERVICEID                                 NOT NULL NUMBER(38)
     NODEID                                    NOT NULL NUMBER(38)
     TIMESTAMP                                 NOT NULL NUMBER(38)
     ISMONITORED                               NOT NULL CHAR(1)
     ISDELETED                                 NOT NULL CHAR(1)
    
  • The servicevo table defines the VOs supported by a service instance:
    SQL> describe servicevo;
     Name                                      Null?    Type
     ----------------------------------------- -------- ----------------------------
     SERVICEID                                 NOT NULL NUMBER(38)
     NODEID                                    NOT NULL NUMBER(38)
     VOID                                      NOT NULL NUMBER(38)
     ISCORE                                    NOT NULL CHAR(1)
    
  • The TestEnv table stores the definition of the environment in which a test has been run:
    SQL> describe testenv;
     Name                                      Null?    Type
     ----------------------------------------- -------- ----------------------------
     ENVID                                     NOT NULL NUMBER(38)
     TIMESTAMP                                 NOT NULL NUMBER(38)
     ENVNAME                                   NOT NULL VARCHAR2(100)
    
    
  • The TestDef table describes the tests:
    SQL> describe testdef;
     Name                                      Null?    Type
     ----------------------------------------- -------- ----------------------------
     TESTID                                    NOT NULL NUMBER(38)
     SERVICEID                                 NOT NULL NUMBER(38)
     CREATEDTIMESTAMP                          NOT NULL NUMBER(38)
     TESTNAME                                  NOT NULL VARCHAR2(100)
     TESTTITLE                                          VARCHAR2(255)
     TESTABBR                                           VARCHAR2(50)
     DATATYPE                                           VARCHAR2(64)
     DATAUNIT                                           VARCHAR2(30)
     DATATHRESHOLD                                      VARCHAR2(64)
     TESTHELP                                           CLOB
    
  • The TestData table contains the results of the tests:
    SQL> describe TestData
     Name                                      Null?    Type
     ----------------------------------------- -------- ----------------------------
     VOID                                      NOT NULL NUMBER(38)
     TESTID                                    NOT NULL NUMBER(38)
     NODEID                                    NOT NULL NUMBER(38)
     ENVID                                     NOT NULL NUMBER(38)
     STATUSID                                  NOT NULL NUMBER(38)
     TIMESTAMP                                 NOT NULL NUMBER(38)
     SUMMARYDATA                                        VARCHAR2(255)
     DETAILEDDATA                                       CLOB
    

  • The TestCriticality table defines which tests are critical for which VO.
    SQL> describe testcriticality;
     Name                                      Null?    Type
     ----------------------------------------- -------- ----------------------------
     VOID                                      NOT NULL NUMBER(38)
     TESTVOID                                  NOT NULL NUMBER(38)
     TESTID                                    NOT NULL NUMBER(38)
     MAXAGE                                    NOT NULL NUMBER(38)
     ISCRITICAL                                NOT NULL CHAR(1)
    

  • The region table defines the regions.
    SQL> describe region
     Name                                      Null?    Type
     ----------------------------------------- -------- ---------------------------- 
     REGIONID                                  NOT NULL NUMBER(38)
     REGIONNAME                                NOT NULL VARCHAR2(50)
     REGIONDESCRIPTION                                  VARCHAR2(255)
    

  • The site table defines the sites.
    SQL> describe site;
     Name                                      Null?    Type
     ----------------------------------------- -------- ---------------------------- 
     SITEID                                    NOT NULL NUMBER(38)
     REGIONID                                  NOT NULL NUMBER(38)
     COUNTRYID                                 NOT NULL NUMBER(38)
     TIER                                      NOT NULL NUMBER(38)
     ISDELETED                                 NOT NULL CHAR(1)
     STATUS                                    NOT NULL VARCHAR2(20)
     TYPE                                      NOT NULL VARCHAR2(20)
     SITENAME                                  NOT NULL VARCHAR2(100)
     SYSADMINCONTACT                           NOT NULL VARCHAR2(255)
     SITEDESCRIPTION                                    VARCHAR2(255)
    

Filling up the database

The database structure is created during the installation. After that, in principle, there are two ways to populate the database:
  • Using the /opt/lcg/same/server/db/cron/bdii2oracle.py script which reads the BDII, translates and inserts the information into the Oracle database.
  • Inserting information by hand, for example:
    insert into serviceinstance values (14,19,1155739159,Y,N);
    commit;
    

The jobwrapper

SFT results did not always correspond to the real site status for all VOs. Therefore, a more accurate monitoring was required.

The basic idea is that a script is called from the CE job wrapper that triggers the validation of the WN on which the job runs on. This will provide some information for every job and extend the tests to cover quickly all VOs and all WNs. The advantage of the CE wrapper is that all jobs - independently of how they reached the site - are monitored (this is the outermost wrapper). After running the validation, the overall result is communicated via the environment and details are provided to the user program via a file. This can result in rule-based termination of the user program. Validation tests should be split into two stages so that tests can be versioned quickly. Test results are also published to R-GMA and the SAM databases in order to be used by the monitoring tools.

The jobwrapper is installed under the /opt/lcg/libexec/jobwrapper directory. What does the jobwrapper do? :

  1. Run or source scripts in /opt/lcg/etc/jobwrapper-start.d/
  2. Start the original executable
  3. Run or source scripts in /opt/lcg/etc/jobwrapper-end.d/

The scripts in the above directories are:

  • run - if they are executable, or
  • sourced - if they are readable (but not executable) - this can be used to change job's environment.

Other tools can put scripts in this directory to have them executed before or after the job.

The query webservice

The query web service is running on:
publisher_wsdl=http://<hostame>:8080/same-ws/services/WebArchiver?wsdl
query_wsdl=http://<hostname>:8080/same-ws/services/Database?wsdl
and its job is to mediate the communication between the clients and the database.

The web portal

The web portal is a python-based web application to display and query the test results. Only users with valid certificates can view the pages. In addition, access is granted based on IP-address. Several configuration and customization options are available and instead of cookies, the portal will remember the user's settings from their certificate subject DN.

NOTE: The portal displays the results with their timestamps shown in GMT winter time ! Forgetting this could cause confusion.

The client part

The client binaries

The client binaries can be found under /opt/lcg/same/client/bin. The most important are the following ones:

  • same-exec - executes sensors
  • same-publish-tuples - publish test result
  • same-query - query the database via the webservice
  • same-run-test - executes individual tests

The client configuration

The configuration file of the client part of SAM is the /opt/lcg/same/client/etc/same.conf. Here you can define the default sensor filters, the SAM exit codes, the server endpoint URLs, loglevels, timeouts, VOs, etc.... The format of the file is obvious and is not detailed here.

The sensors and tests

Structure of the sensors directory.

The client sensors are located in the /opt/lcg/same/client/sensors directory. For each sensor (service) there has to be a directory with the same name. The structure of this directory is the following:
/opt/lcg/same/client/sensors/
                             CE/
                             gRB/
                             SE/
                                check-SE
                                prepare-SE
                                test-sequence.lst
                                tests/
                                      SE-lcg-cp
                                      SE-lcg-cp.def
                             FTS/
                             gCE/
                             .
                             .
                             .
A sensor (i.e a directory under /opt/lcg/same/client/sensors/) is to test a service. A sensor can have several tests. A test is an executable file of any kind, (usually a bash, perl or python script) which has
  • HTML output
  • SAM-defined exit values
  • a set of input parameters given to it by the SAM framework

No other restricition !

Each test name has to have the following form: <SENSOR_NAME>-<TEST_NAME> and has to have an associated .def file which defines the test. In its minimal form it has to contain the following information:

testName: SE-lcg-cp
testAbbr: cp
testTitle: Copy a file back from the SE
EOT
This information gets automatically published when submitting a test ! So there is no need to touch the database every time when adding a new test. New test definitions are published automatically ! Only new sensors have to be added by hand.

Executing a test

What happens when launching a sensor ?
  1. SAM client connects the SAM query web service and retrives the list of nodes which have passed the filter belonging to the sensor. For each sensor, a default filter can be defined in /opt/lcg/same/client/etc/same.conf, or given in the command line.
  2. Publish the test definitions
  3. Goes to the sensor's directory and executes the prepare- script.
  4. For each node in the list executes the check- script with 3 parameter ( sitename, nodename, inMaintenance) and puts it into background. The check script:
    1. creates the working directories
    2. executes the tests in the order defined in test-sequence.lst , and passes the enviroment and input parameters
    3. publish the results to the database - via the web service
  5. For each node a directory under $SAME_SENSOR_WORK/nodes is created which is the same as the node nodename, and all the results of the tests go here.

A test can run on the node from where it was submitted , or a test can submit a job. The same-exec command is to deal with the tests and check their status:

  • same-exec --submit SE - submit a sensor (default operation)
  • same-exec --status SE - status of the test
  • same-exec --cancel SE - to cancel the test
  • same-exec --publish SE - to publish the result of the test

These options have to be explicitly handled in the check- scipts and in the tests.

Exit values of tests

The test exit codes are defined in the same.conf configuration file and have to be one of the following:
ok          =10 (SAME_OK)
info        =20 (SAME_INFO)
notice      =30 (SAME_NOTICE)
warning     =40 (SAME_WARNING)
error       =50 (SAME_ERROR)
critical    =60 (SAME_CRITICAL)
maintenance =100 (SAME_MAINTENANCE)

Result of the test

The overall status of a service is the function of the critical tests, i.e. it is equal to the status of the worst critical test. The criticality of a test could depend on the VO, thus the same service can have different statuses for different VOs. The critical tests are defined in the TestCriticality table.

SAM enviroment variables

The SAM framework sets some enviroment variables and passes them to the tests.
  • SAME_SENSOR_HOME This variable points to the sensor's directory the test belongs to. Ex.: /opt/lcg/same/client/sensors/SE
  • SAME_SENSOR_WORK Points to a place which can be used by the tests to store temporarly files and the their results.
  • SAME_VO Defines the VO under which the tests is actually run.

Submission examples

  • Simple submission of SE test:
    /opt/lcg/same/client/bin/same-exec SE
    
  • Directed submission of SE test to a given node:
    /opt/lcg/same/client/bin/same-exec SE nodename=mySE.cern.ch
    
  • Querying the status of CE test:
    /opt/lcg/same/client/bin/same-exec --publish CE
    

Further reading

Here you can find a collection of links, documentation and e-mail lists which could be useful when working with SAM.

Documentation

  1. Service Availability Monitor (SAM) Documentation - Installation Guide
  2. Adding new tests and publishing results with SAM
  3. SAM installation guide in Pre-Production
  4. Service Availability Monitoring in EGI InSPIRE

Feedback

  1. SAM Savannah - for bug submission

E-mail lists

  1. SAM announcements - same-announce@cernNOSPAMPLEASE.ch
  2. SAM Development - same-devel@cernNOSPAMPLEASE.ch
  3. SAM Support - same-grid-support@cernNOSPAMPLEASE.ch

Download

  1. The SAM APT repo
  2. The APT string: "rpm http://grid-deployment.web.cern.ch/grid-deployment/gis/SAM slc3 SAM"
  3. The SAM CVS
  4. CVS command line settings for the official SAM tests
    export CVSROOT=:pserver:anonymous@jra1mw.cvs.cern.ch:/cvs/jra1mw               
    export CVS_RSH=ssh
    cvs co same
    
  5. CVS command line settings for certification SAM tests
    export CVSROOT=:pserver:anonymous@jra1mw.cvs.cern.ch:/cvs/jra1mw               
    export CVS_RSH=ssh
    cvs co org.glite.testsuites.ctb
    make rpm
    

SAM instances

  1. Production SAM portal
  2. SAM portal for Pre-Production - Certified sites
  3. SAM portal for Pre-Production - Uncertified sites
  4. SAM portal for Certification

Troubleshooting

  1. The LCG directory
  2. The LCG Troubleshooting Guide

Standalone SAM installation

  1. How to install standalone SAM

Gergely Debreczeni

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf AddingSensors.pdf r1 manage 763.9 K 2006-11-29 - 15:34 AndreasUnterkircher Adding new tests and publishing results with SAM
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2011-06-22 - AndresAeschlimann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback