Certification Testbed at GRNET/University of Athens
GRNET
and the
University of Athens
operate a certification testbed. The site name is
EGEE-SEE-CERT
. The main purpose of the testbed is certification of the Torque batch system. In particular certification at GRNET focuses on a double CE configuration.
Coordinators
Ioannis Liabotis
Email :
iliaboti@grnetNOSPAMPLEASE.gr
Nikos Voutsinas
Email :
nvoutsin@nocNOSPAMPLEASE.edunet.gr
General Site Information
Nodes
ctb01.gridctb.uoa.gr |
BDII_site |
ctb02.gridctb.uoa.gr |
MON |
ctb03.gridctb.uoa.gr |
MySQL server |
ctb04.gridctb.uoa.gr |
cream CE |
ctb05.gridctb.uoa.gr |
worker node |
ctb06.gridctb.uoa.gr |
DPM head node |
ctb07.gridctb.uoa.gr |
TORQUE_server |
ctb08.gridctb.uoa.gr |
DPM disk node |
ctb09.gridctb.uoa.gr |
DPM disk node |
ctb10.gridctb.uoa.gr |
worker node |
ctb11.gridctb.uoa.gr |
worker node |
ctb12.gridctb.uoa.gr |
worker node |
ctb13.gridctb.uoa.gr |
worker node |
Software
All software is based on SL4 and gLite 3.1.
Hardware
DELL SC1425 1U rack mounted servers |
CPU Intel Xeon 3GHz |
RAM 2GB (except Torque and worker nodes w/ 512MB) |
LAN dual Gigabit Ethernet |
HDD 80GB |
Support nodes on commodity PCs
Certification level UI |
Production level UI |
DHCP/PXEBOOT/SOL (Serial Over Lan)/NFS server |
Distinguishing features
Modular configuration
In order to support atypical configurations, and stress
YAIM's
configuration capabilities to its design limits, EGEE-SEE-CERT is currently using a modular configuration, whereby each service's configuration resides in a separate file. Also VO specific configuration directives reside in separate files.
This configuration has been tested and found to work correctly for our setup here at EGEE-SEE-CERT. Also it should be noted that this configuration has been tested with yaim-4.0.6. For these reasons the configuration, as it appears below, should not be used light-heartedly, especially if a different setup is desired or a different
YAIM version is used.
Our whole
configuration is available here
.
Central MySQL server
Our site utilizes a single
MySQL database, on a dedicated node. The other nodes that need access to a
MySQL database (MON,
DPM, etc.) are configured to use this central database remotely.
Separate TORQUE server
Our site installs the TORQUE batch server separately from the CE node.
Separate DPM head and disk nodes
Our site installs a dedicated
DPM head node and uses other nodes as dedicated disk nodes.
Short Deadline Jobs
As is explained in the page
ShortJobQueueAtGRNET we recently deployed
ShortDeadlineJobs at our site (currently not operational --
KostantinosKoukopoulos - 28 Jan 2009).
Installation and Configuration
The following information is not provided as a guide or a suggestion to other site administrators , but rather for informative purposes. The setup procedure described is in some ways very site-specific. That said, we hope that others might find it useful in solving their own problems.
Installation and Configuration Outline
The deployment procedure occurs in three general phases:
- Base OS installation
- Common grid-related tasks
- Node-specific installation and configuration
The first phase uses
kickstart
to simultaneously install Scientific Linux 4.7 on all our nodes (
our kickstart configuration is available here
).
The last two phases are performed by our custom installation script (
available here
) which detects the correct node type either from a command line argument or by consulting a host-node map file (
example
). The script along with other files that it uses reside in a shared NFS area (which is also configurable).
This script is often modified when we re-deploy the testbed, usually to change different easy to configure variables at the beginning (for example PATCHES which lists the patches from savannah to include in the yum repo's). See the comments in the script for more examples.
Common node configuration
The installation script, mentioned above, first removes some unecessary packages and disables yum autoupdate. Next it configures the YUM repos, specifically it enables the DAG repository and installs the lcg-CA and certification repo files as well as the repo files for any patches being included in the installation.
With the exception of the central
MySQL node, root certificate packages are then installed and host certificates are installed from the shared location.
Node-specific configuration
The installation script first configures YUM with the node specific repos and installs the necessary packages and meta-packages. In particular for the MON, creamCE, TORQUE and WN nodes Java is installed (see "Installation Issues" below for details on Java). Then Various configuration tasks not undertaken by
YAIM are then performed and various necessary fixes are applied via a series of local diff(1) patches (
available here
; see "Installation Issues" below for more details). Last,
YAIM is called to configure the node.
Patches, scripts and other resources
Patches, bug fixes, scripts and other resources related to our installation are available here:
http://ui.gridctb.uoa.gr/
Installation Issues
There are quite a number of issues that pop up during the deployment of a site such as ours that are not covered by the documentation or have been reported but not fixed yet:
Java Issues
In order to install Sun's JDK we have adopted a procedure that deviates a bit from what currently is prescribed in
GLite31JPackage. First we install xml-commons-jaxp-1.{2,3}-apis to avoid the known "Missing Dependency: jdk" error. Then we install the JDK from SUN's RPM's followed by java-1.5.0-sun-compat. After this,
GLite31JPackage says to disable JPackage 1.7, in order to avoid the
conflicts between sun-jaf and geronimo-jaf
when installing some glite metapackages, but we have found this to be unecessary (and perhaps problematic, since at the time of this writing the JPackage FAQ specifically states that JPackage 5.0 and 1.7 are to be used together). Instead we first install geronimo-jaf-1.1-api which provides the dependency jaf_1_1_api due to which sun-jaf is installed. See the installation script's function "install_java" for the details.
Due to having a separate and dedicated
MySQL node many modifications to the node
YAIM configuration are necessary. In all cases the
YAIM function assumes that
MySQL is either localhost or $MON_HOST with the consequence that proper ACL's are not granted and databases are not created. The following patches are necessary in order for
YAIM to produce a working environment when
MySQL is remote for all the nodes:
TORQUE Issues
Due to having a separate and dedicated TORQUE_server node many modifications to the node
YAIM configuration but also the installation procedure are necessary:
* The TORQUE_server node also needs database permissions which are only given to the CE at the moment (we
patch config_apel_rgma
for this). Also we remove some unecessary
YAIM functions with
this patch of glite-torque-utils
.
DPM Issues
Apart from the necessary site-info configuration for the
glite-SE_dpm_disk
and
glite-SE_dpm_mysql
nodes, there are two patches necessary for correct operation of a separate head and disk node
DPM configuration:
Tomcat Issues
The package
redhat-lsb
has a bug in SL 4.7 coming from RHEL where
/lib/lsb/init-functions
uses 'alias' to define init functions. This makes the tomcat init script to produce
log_failure_msg:command not found
. This has been solved in RHEL 5 as per this
errata page at the RHN
. We solve this issue by applying a
this quick'n dirty fix
to the MON and creamCE nodes.
Other Notes
CreamCE job states transitions
The possible states of a CREAM job are: REGISTERED, PENDING, IDLE, RUNNING, REALLY-RUNNING, HELD, CANCELLED, DONE-OK, DONE-FAILED, ABORTED, UNKNOWN. Below are provided possible states transitions with regard to job suspending, cancelling and purging.
suspending a job (glite-ce-job-suspend)
allowed on |
results to |
running |
held |
really-running |
idle |
Note: PBS (i.e. Torque) does not allow suspending running jobs
cancelling a job (glite-ce-job-cancel)
allowed on |
results to |
pending |
canceled |
idle |
running |
really-running |
held |
purging a job (glite-ce-job-purge)
allowed on |
results to |
registered |
status not available |
done-ok |
done-failed |
aborted |
cancelled |
CreamCE Command Status Type Description & Job Status Type Description
mysql> use creamdb;
Database changed
mysql>
mysql> select * from command_status_type_description;
+------+-------------+
| type | name |
+------+-------------+
| 0 | CREATED |
| 1 | QUEUED |
| 2 | SCHEDULED |
| 3 | RESCHEDULED |
| 4 | PROCESSING |
| 5 | REMOVED |
| 6 | SUCCESSFULL |
| 7 | ERROR |
+------+-------------+
8 rows in set (0.00 sec)
mysql> select * from job_status_type_description;
+------+----------------+
| type | name |
+------+----------------+
| 0 | REGISTERED |
| 1 | PENDING |
| 2 | IDLE |
| 3 | RUNNING |
| 4 | REALLY_RUNNING |
| 5 | CANCELLED |
| 6 | HELD |
| 7 | DONE_OK |
| 8 | DONE_FAILED |
| 9 | PURGED |
| 10 | ABORTED |
+------+----------------+
11 rows in set (0.00 sec)
CreamCE performance notes
Lately we have been involved in testing the performance of a CREAM CE. Here are some of our notes:
monitoring 5k jobs
One of the benchmarks set is that a CREAM CE should be able to handle 5000 queued or running jobs. We have made some preliminary tests by disabling the queue of our LRMS, submitting 5k jobs and then enabling the queue in order to see how the CREAM CE copes.
An issue has risen regarding the performance of the monitoring facilities. We have found that it is problematic to monitor 5k jobs in a cream queue. Specifically we have found that using gilte-ce-job-status when CREAM is handling 5000 short-lived jobs in an enabled LRMS queue (say with 8 slots) does not work most of the time. The error message is:
[kouk@ui ~]$ glite-ce-job-status -e ctb04.gridctb.uoa.gr:8443 -all
2009-04-23 10:50:56,215 WARN - No configuration file suitable for loading. Using built-in configuration
2009-04-23 10:51:26,502 ERROR - EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred.
The above example occured when the following output from qstat was obtained on the PBS server:
[root@ctb07 ~]# qstat -Q
Queue Max Tot Ena Str Que Run Hld Wat Trn Ext T
---------------- --- --- --- --- --- --- --- --- --- --- -
dteam 0 4561 yes yes 4548 8 0 0 0 5 E
see 0 0 yes yes 0 0 0 0 0 0 E
ops 0 1 yes yes 0 1 0 0 0 0 E
Also we have noticed if CEMON notifications are used instead of glite-ce-job-status the notifications arrive so slowly that it is probably impossible to use them for regulating the submission rate of performance tests. We believe that probably the best way to monitor the status of 5k jobs is to query the
MySQL server that CREAM uses directly.