Results of tests run from UoA (EGEE-SEE-CERT)

Date Component Patch Methodology Results
04 Mar 2007 YAIM 1077
At EGEE-SEE-CERT we applied patch#1077 (yaim-3.0.1-9) and proceeded to site reconfiguration using the same we used for yaim-3.0.0-36, to simulate a more realistic scenario of yaim upgrade and find backward compatibility issues.

Here are the findings:

1. SITE_SUPPORT_EMAIL is required and should be defined in

2. _GROUP_ENABLE= should be defined for each queue the site supports (eg DTEAM_GROUP_ENABLE="dteam")

3. Yaim reports wrong version (v.3.0.1-7 instead of -9) in the informational lines following yaim execution

4. Pipe to tee at the end of yaim (bin/yaim) may hang, waiting for open fds to close.

5. yaim configuration of glitece and BDII_site should be done in separated commands (and maybe in specific order? Haven't we seen this one again in previous versions of yaim?)

6. JobID was missing from the lrmsID reported in blah accounting logs.

7. site reconfiguration ended with a broken gliteCE. No jobs could be submitted through our CE. All other nodes such as WMS,MON,BDII,siteBDII(partially),WN(SL3),Torque/Maui, continued to work with no (obvious) problems.

Date Component Patch Methodology Results
30 Jan 2007 TORQUE 950 Report
Stress tests results on torque/maui showed us problems when the number of jobs in queues exceeds 15.000 or more. Maui sometimes crashes or becomes completely unresponsive (memory management and or file descriptors issues?). Although we spend a lot of time to find a systematic way to trigger this type of behavior we couldn't. At the same time we discover that some of these problems have been already reported in and should be well known to Steve Traylen.

No other serious functionality problems have been noticed, with #950, besides some switches that doesn't work (e.g qdel -W ) or the multiple 7's phenomenon of lcg-info-dynamic-scheduler in EstimatedResponseTime. We can safely proceed based on the previous report, to the certification of the patch in EGEE-SEE-CERT's setup

Date Component Patch Methodology Results
05 Jan 2007 TORQUE 950
The latest upgrade of Torque and Maui seems to be going smoothly for EGEE-SEE-CERT. Our site configuration consists of one gliteCE node and a separate torque server. The upgrades were conducted on fresh installation of gLite middleware with yaim-3.0.0-34 using default torque and maui configurations.

To perform the upgrade, make sure you save your configuration, by running

qmgr -c 'print server'

on the torque server. If you use the default torque and maui settings, as in our case, this is not necessary. Now make sure you drain and close your queues:

qdisable queue1 queue2 ...

The final step is the actual upgade on the related nodes: gliteCE, WN, TORQUE_server. Just add the Patch950.uncertified folder in /etc/apt/sources.list.d/lcg.list and run:

apt-get update apt-get dist-upgrade

and reenable all queues:

qenable queue1 queue2 ...

You can then check how the new torque server is performing by running various commands on the CE or TORQUE_server host, such as the following (copied from "torque_quickstart_guide"):

# shutdown server


# start server


# verify all queues are properly configured

qstat -q

# view additional server configuration

qmgr -c 'p s'

# verify all nodes are correctly reporting

pbsnodes -a

# submit a basic job

su - dteam001 -c 'echo "sleep 30" | qsub'

# verify jobs display


Or you can submit jobs as usually from the UI. If something doesn't work, check that /var/spool/maui.cfg in the gliteCE has a line "SERVERHOST your.torque.server".

Finally, please note that "$clienthost your.torque.server" parameter in /var/spool/pbs/mom_priv/config of WN_torque node is deprecated in version 2 of torque and should be replaced by "$pbsserver your.torque.server".

Date Component Patch Methodology Results
14 Dec 2006 TORQUE 917

Our site setup can be seen on the following URL, at the Site Information for EGEE-SEE-CERT:

At first it was attempted to upgrade Torque and Maui using the packages found at Steve Traylen's personal RPM repository:

The upgrade to the new Torque and Maui versions was done by manually upgrading the necessary RPMs in each host (gliteCE, separate TORQUE_server, WN_torque).

For safety reasons queues were drained and disabled, and torque server configuration saved to a file by running: "qmgr -c 'print server'".

So, the ONLY new packages installed (upgraded) in each host were:






















This upgrade process was seamless (except for the removal of some deprecated packages like torque-clients and torque-resmom) and the new torque server was working without any problem. In fact it was obviously more reliable and fast than the previous torque, and the success rate of all job batches was 100%...

The second (and many following...) attempt to certificate patch 917 was done on a middleware installation based on the certification repositories, again on a configuration consisting of seperate TORQUE_server and gliteCE. After the cluster was up and running (properly, as tests indicated), the apt sources.list files was changed to include patch.917, and the upgrade was complete with an "apt-get dist-upgrade".

Much to our surprise, the "apt-get dist-upgrade" process didn't only bring the new torque and maui packages, but 80 to 150 new packages (depending on the node) which previously weren't needed! However no package conflicts occured and every single package installed automatically.

Unfortunately the test job batches never got executed properly. Every daemon seemed to run fine, however all jobs were being aborted. Sometimes all jobs were being held as scheduled eternally. Once, all jobs again were held in running state for too long... And on another installation all jobs just run.

Investigating the problems didn't help. In the beginning we realised it was actually a problem with the new condor (yes, I know, completely other level but we use our own WMS which got upgraded together with the rest of the cluster). Some other time it was a problem of BLParser, that died unexpectedly and other times of Torque. Even when problems got fixed, and some test jobs executed succesfully, a reboot of all nodes just turned the testbed upside-down! Nothing worked again and new problems were surfacing.

To sum up, we can not certify this patch yet. Condor never worked as it should (problems were pinpointed to the WMS-CE communication, but weren't analyzed as that wasn't our primary target) though tests were also made using the SA3 WMS of Cern. BLParser kept dying for no apparent reason. The few times that jobs got through higher level problems, and reached Torque are not a representative sample of the situation, and no safe conclusions can be made given the current state of the cert version of the middleware.

New attempts will be made with the latest middleware updates...

-- Main.iliaboti - 29 Nov 2006

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2008-11-06 - LouisPoncet
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback