Status of criteria to be met before transition can start



Note:
This table deals with the criteria for starting the transitioning from the LCG CE to the CREAM CE.
It does NOT deal with the initial deployment of the CREAM CE into the production service:
the CREAM CE is already available in the production middleware repositories and all sites are strongly encouraged to deploy a CREAM CE in parallel to their LCG CE.

You can also click here to see the status of those features that weren't considered showstoppers for the transition, but which are nevertheless still considered as very important for the CREAM CE.




CRITERIA COMMENTS STATUS
A The CREAM CE should provide at least equivalent functionality and performance as the LCG CE (excluding the ability for users to fork processes directly on the CE) The PPS Pilot of the CREAM CE has not shown any problems with missing functionality so far. The performance of the CREAM CE is still being tested.
NOTE: It is not possible for users to directly fork processes on the CREAM CE.
led-yellow.gif led-yellow.gif led-yellow.gif led-gray.gif
B Condor-G submission to CREAM must be available in production with no significant bugs. Functionality testing has been carried out by US-CMS. Full details available here:
http://hepuser.ucsd.edu/twiki2/bin/view/HEPProjects/CMS-Cream

Results reported by US-CMS on 14th March suggest that there is a significant job failure rate (25%) and that the major source of errors is around proxy renewal and delegation. This issue is currently under investigation by US-CMS and the developers.


*Update 18-Sep-09 by Massimo Sgaravatto (CREAM developer)*
At latest CHEP, Sanjay Padhi from US-CMS reported that he performed some tests with CondorG submitting to CREAM, but there were many failures due to proxy delegation and renewal. Since this was not a known issue, the CREAM team decided to arrange a debugging session using a dedicated CREAM CE.This was done, and in this test the proxy renewal/delegation problem was not seen at all.Instead several jobs were canceled because the client (Condor-G) triggered the cancel commands. This was reported to Sanjay for investigation.I haven't heard anything by Sanjay anymore (I was told by some CMS colleagues of him he is busy with other activities).

Todd Tannenbaum (Condor team), replying to a mail to Maarten, on Sept 17, 2009 reported they are going to continue such activities.We offered our help (in particular we made available a CREAM CE to be used for tests)

Last update: 19/09/09
led-orange.gif led-orange.gif led-gray.gif led-gray.gif
C The ICE enabled WMS must be in production with no significant bugs. ? Waiting for fixes to ICE in PATCH:2459 (In particular, for bug BUG:44604 which causes a dramatic slowdown of the submission rate to the CREAM CE). This patch is now Ready For Integration (as of 5th March).
UPDATE: The PATCH:2459 has been obsoleted by PATCH:2597 "gLite 3.2 WMS". This is in certification.

? . PATCH:2862 is needed as it contains a fix for BUG:47911 "Performance issues when ICE has to manage many (thousands) active jobs which refer to many (thousands) different user proxy files on the WMS node".
Basically, this bug causes performance issues (see point below).


Click here to see the Known Issues currently affecting ICE
Main issue affecting production at the moment is: "There is a known incompatibility between the version of ICE deployed on the production WMS and the current version of CREAM which causes a crash in the ICE daemon. Therefore the submission to CREAM via WMS is not currently possible. Please use the direct submission instead."

The first version of ICE-enabled WMS went to production with gLite3.1 Update 53 . A new version corresponding to PATCH:2862 currently in PPS fixes BUG:47911 (but with status FIX not certified). Adequate testing of the chain still an issue
*Update 18-Sep-09 by Massimo Sgaravatto (CREAM developer) and Antonio Retico (Preproduction)*:
The latest available version of ICE is the one provided with patch #2862 in production since the 22nd of Sept (with gLite3.1 Update 55)
Results of the pre-certification of this patch are available at:
https://twiki.cnaf.infn.it/cgi-bin/twiki/view/EgeeJra1It/WmsTestsP2862
Test of the ICE+CREAM chain by CMS in progress at CNAF, FZK , CERN and RAL

Last update: 30/09/09
led-green.gif led-green.gif led-yellow.gif led-yellow.gif
D.i The ICE / CREAM job submission chain should be able to meet all performance criteria and otherwise perform at least as well as the WMS / LCG CE submission chain. Benchmark for WMS / LCG CE submission rate is 15,000 jobs per day (average of 1 job per 5.75 seconds): based on "best case" usage at CERN.

Job failure rate for long running jobs, submitted at 40 jobs/min, showing significant failure rate (seen in PPS pilot). This is covered by BUG:47911 (see item above). A new version corresponding to PATCH:2862 currently in PPS fixes BUG:47911 (but with status FIX not certified). adequate testing of the chain still an issue
*Update 22-Sep-09 by Massimo Sgaravatto (CREAM developer)*:
The latest available version of ICE is the one provided with patch #2862 in production since the 22nd of Sept (with gLite3.1 Update 55)
Results of the pre-certification of this patch are available at:
https://twiki.cnaf.infn.it/cgi-bin/twiki/view/EgeeJra1It/WmsTestsP2862

Last update: 18/09/09
led-green.gif led-yellow.gif led-yellow.gif led-gray.gif
D.ii 1 The ICE-WMS must deal gracefully with large peaks in the rate of jobs submitted to it. Testing is in progress by EGEE SA3.
This criterion was not covered by the tests done by SA3
Test details available at http://gridctb.uoa.gr/cream-performance-notes/report.html#criteria
However there is an outstanding bug for this - BUG:48786 .

*Update 18-Sep-09 by Massimo Sgaravatto (CREAM developer)*:
CREAM is able to protect itself if a certain policy based on the number of idle and/or running jobs in CREAM is matched (so e.g. CREAM can be configured to not accept anymore jobs if it is already managing x jobs).What is missing is disabling job submissions also if the machine is overloaded(considering load, mem, etc.). This is tracked in bug #48786, whose fix is supposed to be released with patch #3179

Last update: 18/09/09
led-orange.gif led-orange.gif led-gray.gif led-gray.gif
E The CREAM CE must use an acceptable (to sites) proxy renewal mechanism. The patch 2669 introduced the LCG CE proxy renewal mechanism into the CREAM CE. This patch was released to production in gLite Update 41 (25th Feb). led-green.gif led-green.gif led-green.gif led-green.gif
Done
F An adequate set of monitoring probes (SAM/Nagios) must be available for the CREAM CE. The set of SAM tests for the LCG CE also work against the CREAM CE. Some additional tests, specific to the CREAM CE have also been written.
? The results of CREAM CE tests being run in the PPS are here:
..... PPS SAM results for CREAM CE.
? The results of CREAM CE tests being run in production are here:
..... SAM results for CREAM CEs in production.

An analysis is in progress to decide if these tests are sufficient or if more tests are needed, however, it is difficult to do this until we have a working ICE-WMS to CREAM CE chain in production.


Last update: 12/05/09
led-yellow.gif led-yellow.gif led-yellow.gif led-gray.gif
G There is a clear plan, with agreed implementation timelines, for migration of the CREAM CE away from gJAF. From the developers: Since patch #2669 (now in certification) we are not using anymore gJAF (the one implemented by the security cluster): we copied the relevant stuff into the ce code, customized it as needed, and we are maintaining it (see Savannah task 7745).
Next step will be the integration with the new authz service, when available (see Savannah task 7746): we will have a meeting next tuesday (17 Feb) with the NewAuthZService developers to discuss about this integration with CREAM.
led-green.gif led-green.gif led-green.gif led-green.gif
Done
H.i The following batch systems must be integrated by default:
? LSF,
? PBS-Torque/Maui,
? SGE,
? Condor
? LSF: Demonstrated in the PPS pilot.
? Torque/Maui: Demonstrated in the PPS pilot.
? Condor: Implemented by the BLAH developers. Submission to Condor via CREAM/blah was successfully tested for functionality. Has this been stress tested?
? SGE: Work in progress at CESGA. Expected around mid-April.
? BQS: Work in progress at IN2P3 (Sylvain Reynaud)

Last update: 18/09/09
led-yellow.gif led-yellow.gif led-yellow.gif led-gray.gif
H.ii 2 The new BLAH Parser (i.e. BUpdater/BNotifier) should be integrated into the CREAM CE. From Sylvain Reynaud (IN2P3):
"The BLAH team sent me the RPM and all the needed information to make this work ... If stress tests are not needed, I think this requirement can be considered as satisfied."
*Update 18-Sep-09 by Massimo Sgaravatto (CREAM developer)*:
For Condor the new BLAH Parser is already used.For LSF and Torque/PBS, patch #3259 introduces the new blparser. It is possible to choose the blparser type (the old one parsing the log files or the new one using the batch system command/history commands) at configuration time

Last update: 18/09/09
led-yellow.gif led-yellow.gif led-yellow.gif led-gray.gif
I The process for integrating other batch systems must be fully documented. The Blah Guide has been checked over by the BQS experts at IN2P3 and is considered to be "very useful". led-green.gif led-green.gif led-green.gif led-green.gif
Done
J At least 5000 simultaneous jobs per CE node Testing is in progress by EGEE SA3.
This criterion was verified during the tests done by SA3.
Test details available at http://gridctb.uoa.gr/cream-performance-notes/report.html#criteria

Last update: 31/07/09
led-green.gif led-green.gif led-green.gif led-green.gif
Done
K Unlimited number of user/role/submission node combinations from many VO's (at least 50), up to the limit of the number of jobs supported on a CE node Testing is in progress by EGEE SA3.
This criterion was not covered by the tests done by SA3
Test details available at http://gridctb.uoa.gr/cream-performance-notes/report.html#criteria

Last update: 31/07/09
led-orange.gif led-orange.gif led-gray.gif led-gray.gif
L Job failure rates in normal operations due to the CE < 0.1% Testing is in progress by EGEE SA3.
This criterion was verified during the tests done by SA3.
Test details available at http://gridctb.uoa.gr/cream-performance-notes/report.html#criteria

Last update: 31/07/09
led-green.gif led-green.gif led-green.gif led-green.gif
Done
M Job failures due to restart of CE services or reboot < 0.1% Testing is in progress by EGEE SA3.
This criterion was partially verified during the tests done by SA3. During these tests no job failures noticed during reboot
Test details available at http://gridctb.uoa.gr/cream-performance-notes/report.html#criteria

Last update: 31/07/09
led-yellow.gif led-yellow.gif led-yellow.gif led-yellow.gif
N 1 month unattended running without significant performance degradation During summer 2008 Alice were submitting jobs to a CREAM CE at FZK at a rate which gave a relatively continuous load of ~2000 concurrent jobs. Both Alice and the site administrator (Angela Poschlad) commented that the CE was remarkably stable during this time. Once Alice have been using the CREAM CEs in production for a little longer, we will decide if this criteria is fully met. led-yellow.gif led-yellow.gif led-yellow.gif led-gray.gif
O Graceful failure or self-limiting behavior when the CE load reaches its maximum (e.g. if a CE node can support only 5000 jobs it must not crash or become unresponsive with more than that) Testing is in progress by EGEE SA3.
This criterion was partially verified during the tests done by SA3. Room for improvement .
Test details available at http://gridctb.uoa.gr/cream-performance-notes/report.html#criteria
*Update 18-Sep-09 by Massimo Sgaravatto (CREAM developer)*:
CREAM is able to protect itself if a certain policy based on the number of idle and/or running jobs in CREAM is matched (so e.g. CREAM can be configured to not accept anymore jobs if it is already managing x jobs). What is missing is disabling job submissions also if the machine is overloaded(considering load, mem, etc.). This is tracked in bug #48786, whose fix is supposed to be released with patch #3179

Last update: 18/09/09
led-yellow.gif led-yellow.gif led-yellow.gif led-yellow.gif

-- NickThackray - 10 Feb 2009

Edit | Attach | Watch | Print version | History: r29 < r28 < r27 < r26 < r25 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r27 - 2009-09-30 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback