SAM Overview
Introduction
SAM (Service Availability Monitoring) is a framework for the monitoring of production and pre-production grid sites. It provides a set of probes which are submitted at regular intervals, and a database that stores test results. In effect, SAM provides monitoring of grid services from a user perspective.
SAM components
The SAM testing and monitoring framework is made up of several components:
lcg-sam-jobwrapper - SAM jobwrapper
lcg-sam-client - SAM Client
lcg-sam-client-WNconfig-EGEE - SAM Client configuration for EGEE production grid with publishing to production DB
lcg-sam-client-WNconfig-EGEEval - SAM Client configuration for EGEE production grid with publishing to validation DB
lcg-sam-client-sensors - SAM Client Sensors
lcg-sam-server-db - Service Availability Monitoring Database
lcg-sam-server-portal - Service Availability Monitoring Web Portal
lcg-sam-server-ws - SAM query web service
lcg-sam-server-xsql - XDK/XSQL based web interface to SAM DB
The SAM database
Structure of the database
SAM uses an Oracle database to store the test definitions, node and site information, and the test results. The database contains 248 different user objects (in the Oracle sense), but here we describe only the simplified structure of the SAM database:
- The
service
table defines a list of services which are 'monitorable' by SAM :
SQL> describe service
Name Null? Type
----------------------------------------- -------- ----------------------------
SERVICEID NOT NULL NUMBER(38)
SERVICESCOPE NOT NULL CHAR(1)
SERVICEABBR NOT NULL VARCHAR2(8)
SERVICENAME NOT NULL VARCHAR2(100)
- The
nodes
table defines the nodes to be monitored:
SQL> describe node;
Name Null? Type
----------------------------------------- -------- ----------------------------
NODEID NOT NULL NUMBER(38)
SITEID NOT NULL NUMBER(38)
TIMESTAMP NOT NULL NUMBER(38)
ISDELETED NOT NULL CHAR(1)
NODENAME NOT NULL VARCHAR2(255)
- The
serviceinstance
table defines an instance of a service:
SQL> describe serviceinstance;
Name Null? Type
----------------------------------------- -------- ----------------------------
SERVICEID NOT NULL NUMBER(38)
NODEID NOT NULL NUMBER(38)
TIMESTAMP NOT NULL NUMBER(38)
ISMONITORED NOT NULL CHAR(1)
ISDELETED NOT NULL CHAR(1)
- The
servicevo
table defines the VOs supported by a service instance:
SQL> describe servicevo;
Name Null? Type
----------------------------------------- -------- ----------------------------
SERVICEID NOT NULL NUMBER(38)
NODEID NOT NULL NUMBER(38)
VOID NOT NULL NUMBER(38)
ISCORE NOT NULL CHAR(1)
- The
TestEnv
table stores the definition of the environment in which a test has been run:
SQL> describe testenv;
Name Null? Type
----------------------------------------- -------- ----------------------------
ENVID NOT NULL NUMBER(38)
TIMESTAMP NOT NULL NUMBER(38)
ENVNAME NOT NULL VARCHAR2(100)
- The
TestDef
table describes the tests:
SQL> describe testdef;
Name Null? Type
----------------------------------------- -------- ----------------------------
TESTID NOT NULL NUMBER(38)
SERVICEID NOT NULL NUMBER(38)
CREATEDTIMESTAMP NOT NULL NUMBER(38)
TESTNAME NOT NULL VARCHAR2(100)
TESTTITLE VARCHAR2(255)
TESTABBR VARCHAR2(50)
DATATYPE VARCHAR2(64)
DATAUNIT VARCHAR2(30)
DATATHRESHOLD VARCHAR2(64)
TESTHELP CLOB
- The
TestData
table contains the results of the tests:
SQL> describe TestData
Name Null? Type
----------------------------------------- -------- ----------------------------
VOID NOT NULL NUMBER(38)
TESTID NOT NULL NUMBER(38)
NODEID NOT NULL NUMBER(38)
ENVID NOT NULL NUMBER(38)
STATUSID NOT NULL NUMBER(38)
TIMESTAMP NOT NULL NUMBER(38)
SUMMARYDATA VARCHAR2(255)
DETAILEDDATA CLOB
Filling up the database
The database structure is created during the installation. After that, in principle, there are two ways to populate the database:
- Using the
/opt/lcg/same/server/db/cron/bdii2oracle.py
script which reads the BDII, translates and inserts the information into the Oracle database.
- Inserting information by hand, for example:
insert into serviceinstance values (14,19,1155739159,Y,N);
commit;
The jobwrapper
SFT results did not always correspond to the real site status for all VOs. Therefore, a more accurate monitoring was required.
The basic idea is that a script is called from the CE job wrapper that triggers the validation of the WN on which the job runs on. This will provide some information for every job and extend the tests to cover quickly all VOs and all WNs. The advantage of the CE wrapper is that all jobs - independently of how they reached the site - are monitored (this is the outermost wrapper). After running the validation, the overall result is communicated via the environment and details are provided to the user program via a file. This can result in rule-based termination of the user program. Validation tests should be split into two stages so that tests can be versioned quickly. Test results are also published to R-GMA and the SAM databases in order to be used by the monitoring tools.
The jobwrapper is installed under the
/opt/lcg/libexec/jobwrapper
directory. What does the jobwrapper do? :
- Run or source scripts in /opt/lcg/etc/jobwrapper-start.d/
- Start the original executable
- Run or source scripts in /opt/lcg/etc/jobwrapper-end.d/
The scripts in the above directories are:
- run - if they are executable, or
- sourced - if they are readable (but not executable) - this can be used to change job's environment.
Other tools can put scripts in this directory to have them executed before or after the job.
The query webservice
The query web service is running on:
publisher_wsdl=http://<hostame>:8080/same-ws/services/WebArchiver?wsdl
query_wsdl=http://<hostname>:8080/same-ws/services/Database?wsdl
and its job is to mediate the communication between the clients and the database.
The web portal
The web portal is a python-based web application to display and query the test results. Only users with valid certificates can view the pages. In addition, access is granted based on IP-address. Several configuration and customization options are available and instead of cookies, the portal will remember the user's settings from their certificate subject DN.
NOTE: The portal displays the results with their timestamps shown in GMT winter time ! Forgetting this could cause confusion.
The client part
The client binaries
The client binaries can be found under
/opt/lcg/same/client/bin
. The most important are the following ones:
-
same-exec
- executes sensors
-
same-publish-tuples
- publish test result
-
same-query
- query the database via the webservice
-
same-run-test
- executes individual tests
The client configuration
The configuration file of the client part of SAM is the
/opt/lcg/same/client/etc/same.conf
. Here you can define the default sensor filters, the SAM exit codes, the server endpoint URLs, loglevels, timeouts, VOs, etc.... The format of the file is obvious and is not detailed here.
The sensors and tests
Structure of the sensors
directory.
The client sensors are located in the
/opt/lcg/same/client/sensors
directory. For each sensor (service) there has to be a directory with the same name. The structure of this directory is the following:
/opt/lcg/same/client/sensors/
CE/
gRB/
SE/
check-SE
prepare-SE
test-sequence.lst
tests/
SE-lcg-cp
SE-lcg-cp.def
FTS/
gCE/
.
.
.
A
sensor (i.e a directory under
/opt/lcg/same/client/sensors/
) is to test a
service. A sensor can have several
tests. A test is an executable file of any kind, (usually a bash, perl or python script) which has
- HTML output
- SAM-defined exit values
- a set of input parameters given to it by the SAM framework
No other restricition !
Each test name has to have the following form: <SENSOR_NAME>-<TEST_NAME> and has to have an associated
.def
file which defines the test. In its minimal form it has to contain the following information:
testName: SE-lcg-cp
testAbbr: cp
testTitle: Copy a file back from the SE
EOT
This information gets automatically published when submitting a test ! So there is no need to touch the database every time when adding a new test. New test definitions are published automatically ! Only new sensors have to be added by hand.
Executing a test
What happens when launching a sensor ?
- SAM client connects the SAM query web service and retrives the list of nodes which have passed the filter belonging to the sensor. For each sensor, a default filter can be defined in
/opt/lcg/same/client/etc/same.conf
, or given in the command line.
- Publish the test definitions
- Goes to the sensor's directory and executes the
prepare-
script.
- For each node in the list executes the
check-
script with 3 parameter ( sitename
, nodename
, inMaintenance
) and puts it into background. The check
script:
- creates the working directories
- executes the tests in the order defined in
test-sequence.lst
, and passes the enviroment and input parameters
- publish the results to the database - via the web service
- For each node a directory under $SAME_SENSOR_WORK/nodes is created which is the same as the node nodename, and all the results of the tests go here.
A test can run on the node from where it was submitted , or a test can submit a job. The
same-exec
command is to deal with the tests and check their status:
-
same-exec --submit SE
- submit a sensor (default operation)
-
same-exec --status SE
- status of the test
-
same-exec --cancel SE
- to cancel the test
-
same-exec --publish SE
- to publish the result of the test
These options have to be explicitly handled in the
check-
scipts and in the tests.
Exit values of tests
The test exit codes are defined in the
same.conf
configuration file and have to be one of the following:
ok =10 (SAME_OK)
info =20 (SAME_INFO)
notice =30 (SAME_NOTICE)
warning =40 (SAME_WARNING)
error =50 (SAME_ERROR)
critical =60 (SAME_CRITICAL)
maintenance =100 (SAME_MAINTENANCE)
Result of the test
The overall status of a service is the function of the critical tests, i.e. it is equal to the status of the worst critical test. The criticality of a test could depend on the VO, thus the same service can have different statuses for different VOs. The critical tests are defined in the
TestCriticality
table.
SAM enviroment variables
The SAM framework sets some enviroment variables and passes them to the tests.
-
SAME_SENSOR_HOME
This variable points to the sensor's directory the test belongs to. Ex.: /opt/lcg/same/client/sensors/SE
-
SAME_SENSOR_WORK
Points to a place which can be used by the tests to store temporarly files and the their results.
-
SAME_VO
Defines the VO under which the tests is actually run.
Submission examples
- Simple submission of SE test:
/opt/lcg/same/client/bin/same-exec SE
- Directed submission of SE test to a given node:
/opt/lcg/same/client/bin/same-exec SE nodename=mySE.cern.ch
- Querying the status of CE test:
/opt/lcg/same/client/bin/same-exec --publish CE
Further reading
Here you can find a collection of links, documentation and e-mail lists which could be useful when working with SAM.
Documentation
- Service Availability Monitor (SAM) Documentation - Installation Guide
- Adding new tests and publishing results with SAM
- SAM installation guide in Pre-Production
- Service Availability Monitoring in EGI InSPIRE
Feedback
- SAM Savannah
- for bug submission
E-mail lists
- SAM announcements - same-announce@cernNOSPAMPLEASE.ch
- SAM Development - same-devel@cernNOSPAMPLEASE.ch
- SAM Support - same-grid-support@cernNOSPAMPLEASE.ch
Download
- The SAM APT repo
- The APT string: "rpm http://grid-deployment.web.cern.ch/grid-deployment/gis/SAM
slc3 SAM"
- The SAM CVS
- CVS command line settings for the official SAM tests
export CVSROOT=:pserver:anonymous@jra1mw.cvs.cern.ch:/cvs/jra1mw
export CVS_RSH=ssh
cvs co same
- CVS command line settings for certification SAM tests
export CVSROOT=:pserver:anonymous@jra1mw.cvs.cern.ch:/cvs/jra1mw
export CVS_RSH=ssh
cvs co org.glite.testsuites.ctb
make rpm
SAM instances
- Production SAM portal
- SAM portal for Pre-Production - Certified sites
- SAM portal for Pre-Production - Uncertified sites
- SAM portal for Certification
Troubleshooting
- The LCG directory
- The LCG Troubleshooting Guide
Standalone SAM installation
- How to install standalone SAM
Gergely Debreczeni