Certification Testbed at GRNET/University of Athens

GRNET and the University of Athens operate a certification testbed. The site name is EGEE-SEE-CERT. The main purpose of the testbed is certification of the Torque batch system. In particular certification at GRNET focuses on a double CE configuration.


Ioannis Liabotis

Email : iliaboti@grnetNOSPAMPLEASE.gr

Nikos Voutsinas

Email : nvoutsin@nocNOSPAMPLEASE.edunet.gr

General Site Information


ctb01.gridctb.uoa.gr BDII_site
ctb02.gridctb.uoa.gr MON
ctb03.gridctb.uoa.gr MySQL server
ctb04.gridctb.uoa.gr cream CE
ctb05.gridctb.uoa.gr worker node
ctb06.gridctb.uoa.gr DPM head node
ctb07.gridctb.uoa.gr TORQUE_server
ctb08.gridctb.uoa.gr DPM disk node
ctb09.gridctb.uoa.gr DPM disk node
ctb10.gridctb.uoa.gr worker node
ctb11.gridctb.uoa.gr worker node
ctb12.gridctb.uoa.gr worker node
ctb13.gridctb.uoa.gr worker node


All software is based on SL4 and gLite 3.1.


DELL SC1425 1U rack mounted servers
CPU Intel Xeon 3GHz
RAM 2GB (except Torque and worker nodes w/ 512MB)
LAN dual Gigabit Ethernet

Support nodes on commodity PCs

Certification level UI
Production level UI
DHCP/PXEBOOT/SOL (Serial Over Lan)/NFS server

Distinguishing features

Modular configuration

In order to support atypical configurations, and stress YAIM's configuration capabilities to its design limits, EGEE-SEE-CERT is currently using a modular configuration, whereby each service's configuration resides in a separate file. Also VO specific configuration directives reside in separate files.

This configuration has been tested and found to work correctly for our setup here at EGEE-SEE-CERT. Also it should be noted that this configuration has been tested with yaim-4.0.6. For these reasons the configuration, as it appears below, should not be used light-heartedly, especially if a different setup is desired or a different YAIM version is used.

Our whole configuration is available here.

Central MySQL server

Our site utilizes a single MySQL database, on a dedicated node. The other nodes that need access to a MySQL database (MON, DPM, etc.) are configured to use this central database remotely.

Separate TORQUE server

Our site installs the TORQUE batch server separately from the CE node.

Separate DPM head and disk nodes

Our site installs a dedicated DPM head node and uses other nodes as dedicated disk nodes.

Short Deadline Jobs

As is explained in the page ShortJobQueueAtGRNET we recently deployed ShortDeadlineJobs at our site (currently not operational -- KostantinosKoukopoulos - 28 Jan 2009).

Installation and Configuration

The following information is not provided as a guide or a suggestion to other site administrators , but rather for informative purposes. The setup procedure described is in some ways very site-specific. That said, we hope that others might find it useful in solving their own problems.

Installation and Configuration Outline

The deployment procedure occurs in three general phases:
  1. Base OS installation
  2. Common grid-related tasks
  3. Node-specific installation and configuration
The first phase uses kickstart to simultaneously install Scientific Linux 4.7 on all our nodes (our kickstart configuration is available here).

The last two phases are performed by our custom installation script (available here) which detects the correct node type either from a command line argument or by consulting a host-node map file (example). The script along with other files that it uses reside in a shared NFS area (which is also configurable).

This script is often modified when we re-deploy the testbed, usually to change different easy to configure variables at the beginning (for example PATCHES which lists the patches from savannah to include in the yum repo's). See the comments in the script for more examples.

Common node configuration

The installation script, mentioned above, first removes some unecessary packages and disables yum autoupdate. Next it configures the YUM repos, specifically it enables the DAG repository and installs the lcg-CA and certification repo files as well as the repo files for any patches being included in the installation.

With the exception of the central MySQL node, root certificate packages are then installed and host certificates are installed from the shared location.

Node-specific configuration

The installation script first configures YUM with the node specific repos and installs the necessary packages and meta-packages. In particular for the MON, creamCE, TORQUE and WN nodes Java is installed (see "Installation Issues" below for details on Java). Then Various configuration tasks not undertaken by YAIM are then performed and various necessary fixes are applied via a series of local diff(1) patches (available here ; see "Installation Issues" below for more details). Last, YAIM is called to configure the node.

Patches, scripts and other resources

Patches, bug fixes, scripts and other resources related to our installation are available here: http://ui.gridctb.uoa.gr/

Installation Issues

There are quite a number of issues that pop up during the deployment of a site such as ours that are not covered by the documentation or have been reported but not fixed yet:

Java Issues

In order to install Sun's JDK we have adopted a procedure that deviates a bit from what currently is prescribed in GLite31JPackage. First we install xml-commons-jaxp-1.{2,3}-apis to avoid the known "Missing Dependency: jdk" error. Then we install the JDK from SUN's RPM's followed by java-1.5.0-sun-compat. After this, GLite31JPackage says to disable JPackage 1.7, in order to avoid the conflicts between sun-jaf and geronimo-jaf when installing some glite metapackages, but we have found this to be unecessary (and perhaps problematic, since at the time of this writing the JPackage FAQ specifically states that JPackage 5.0 and 1.7 are to be used together). Instead we first install geronimo-jaf-1.1-api which provides the dependency jaf_1_1_api due to which sun-jaf is installed. See the installation script's function "install_java" for the details.

MySQL Issues

Due to having a separate and dedicated MySQL node many modifications to the node YAIM configuration are necessary. In all cases the YAIM function assumes that MySQL is either localhost or $MON_HOST with the consequence that proper ACL's are not granted and databases are not created. The following patches are necessary in order for YAIM to produce a working environment when MySQL is remote for all the nodes:


Due to having a separate and dedicated TORQUE_server node many modifications to the node YAIM configuration but also the installation procedure are necessary:

  • on a combined creamCE/TORQUE_server installation the blah parser is installed and configured by the creamCE metapackage. But when TORQUE_server is installed separately we need to manually install the blah parser and the yaim creamCE module needed for it's configuration. We therefore install the glite-ce-blahp and glite-yaim-cream-ce packages and execute the following on the TORQUE_node:
    /opt/glite/yaim/bin/yaim -r -s $YAIM_CONF/siteinfo/site-info.def -n creamCE -f config_cream_blparser
    /opt/glite/yaim/bin/yaim -r -s $YAIM_CONF/siteinfo/site-info.def -n creamCE -f config_glite_initd
* The TORQUE_server node also needs database permissions which are only given to the CE at the moment (we patch config_apel_rgma for this). Also we remove some unecessary YAIM functions with this patch of glite-torque-utils.

DPM Issues

Apart from the necessary site-info configuration for the glite-SE_dpm_disk and glite-SE_dpm_mysql nodes, there are two patches necessary for correct operation of a separate head and disk node DPM configuration:

Tomcat Issues

The package redhat-lsb has a bug in SL 4.7 coming from RHEL where /lib/lsb/init-functions uses 'alias' to define init functions. This makes the tomcat init script to produce log_failure_msg:command not found. This has been solved in RHEL 5 as per this errata page at the RHN. We solve this issue by applying a this quick'n dirty fix to the MON and creamCE nodes.

Other Notes

CreamCE job states transitions

The possible states of a CREAM job are: REGISTERED, PENDING, IDLE, RUNNING, REALLY-RUNNING, HELD, CANCELLED, DONE-OK, DONE-FAILED, ABORTED, UNKNOWN. Below are provided possible states transitions with regard to job suspending, cancelling and purging.

suspending a job (glite-ce-job-suspend)

allowed on results to
running held

Note: PBS (i.e. Torque) does not allow suspending running jobs

cancelling a job (glite-ce-job-cancel)

allowed on results to
pending canceled

purging a job (glite-ce-job-purge)

allowed on results to
registered status not available

CreamCE Command Status Type Description & Job Status Type Description

mysql> use creamdb;
Database changed
mysql> select * from command_status_type_description;
| type | name        |
|    0 | CREATED     |
|    1 | QUEUED      |
|    2 | SCHEDULED   |
|    3 | RESCHEDULED |
|    4 | PROCESSING  |
|    5 | REMOVED     |
|    6 | SUCCESSFULL |
|    7 | ERROR       |
8 rows in set (0.00 sec)

mysql> select * from job_status_type_description;
| type | name           |
|    0 | REGISTERED     |
|    1 | PENDING        |
|    2 | IDLE           |
|    3 | RUNNING        |
|    5 | CANCELLED      |
|    6 | HELD           |
|    7 | DONE_OK        |
|    8 | DONE_FAILED    |
|    9 | PURGED         |
|   10 | ABORTED        |
11 rows in set (0.00 sec)

CreamCE performance notes

Lately we have been involved in testing the performance of a CREAM CE. Here are some of our notes:

monitoring 5k jobs

One of the benchmarks set is that a CREAM CE should be able to handle 5000 queued or running jobs. We have made some preliminary tests by disabling the queue of our LRMS, submitting 5k jobs and then enabling the queue in order to see how the CREAM CE copes.

An issue has risen regarding the performance of the monitoring facilities. We have found that it is problematic to monitor 5k jobs in a cream queue. Specifically we have found that using gilte-ce-job-status when CREAM is handling 5000 short-lived jobs in an enabled LRMS queue (say with 8 slots) does not work most of the time. The error message is:

[kouk@ui ~]$  glite-ce-job-status -e ctb04.gridctb.uoa.gr:8443 -all
2009-04-23 10:50:56,215 WARN - No configuration file suitable for loading. Using built-in configuration
2009-04-23 10:51:26,502 ERROR - EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred.

The above example occured when the following output from qstat was obtained on the PBS server:

[root@ctb07 ~]# qstat -Q
Queue              Max   Tot   Ena   Str   Que   Run   Hld   Wat   Trn   Ext T
----------------   ---   ---   ---   ---   ---   ---   ---   ---   ---   --- -
dteam                0  4561   yes   yes  4548     8     0     0     0     5 E
see                  0     0   yes   yes     0     0     0     0     0     0 E
ops                  0     1   yes   yes     0     1     0     0     0     0 E

Also we have noticed if CEMON notifications are used instead of glite-ce-job-status the notifications arrive so slowly that it is probably impossible to use them for regulating the submission rate of performance tests. We believe that probably the best way to monitor the status of 5k jobs is to query the MySQL server that CREAM uses directly.

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2009-04-24 - KostantinosKoukopoulos
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback