Summary of status as of 14/1/08

The new developments on the CREAM service have been basically finalized. In particular the two major changes that have been being applied are:

  • The integration of a DB for the backend, to address the proxy renewal problem (task 4.10 of the CE checklist) and to address other scalability issues
  • A revision of the CREAM interface, to optimize some operations (in particulat the lease operation, which proved to be not scalable). Just to fasten the testing and the debugging, the old interface has been preserved (i.e. to be able to submit and cancel jobs from the "old" ICE client, while the new one is being implemented).

Some tests (also stress tests, lasting 6 days, but not including proxy renewal tests) have been done using the old ICE client.

The new ICE client is being finalized (estimation time: 1-2 days): the changes of the interface in CREAM introduced some interoperability issues that needed to be addressed.

Then (this week) we will start functionality and stress tests (including also the proxy renewal).

When we see that there are no major problems (rough estimation: 10 days, but it depends on which/how many problems will have to be addressed) tests can be done by "independent" testers.

Then the CREAM and ICE software can be released for certification.

Other relevant tasks of the CREAM CE checklist:

  • The installation procedure of CREAM via yaim (task 2.1) already implemented for the old CREAM software, has been modified since the new CREAM developments require also some changes in the installation and the configuration. An installation with this new yaim installation procedure has already been done and proved to work, even if there is something to fix. Estimation to finalize the yaim based installation procedure: 1 week. Please note that, since yaim-lsf-utils doesn't exist yet, for the time being it is possible to test the installation procedure only for PBS/Torque
  • For what concerns the batch system support and the porting to new batch systems (task 4.6), the interaction with the batch system is fully managed by BLAH, which already supports Torque/PBS and LSF (submissions to these batch systems via CREAM has been verified). As already reported, the BLAH BLparser has been reimplemented (basically referring to the the batch system status/history commands instead of parsing the batch system log files), also to facilitate the porting to new batch systems. A first implementation of this new BLAH BLParser with the relevant "plugin" supporting Condor has been done, and also the relevant changes in the CREAM software have been implemented. The PIC SA3 persons have been contacted, and they are going to let us make available their Condor environment to test and debug this BLAH and CREAM integration.

Some issues:

  • The CREAM client software (CREAM CLI and ICE) requires VOMS 1.8 (which has been released about 2 months ago). The issue is that the WMS now in certification is still using VOMS 1.7
  • For what concerns the CREAM CLI (task 4.4) it must be decided if it must be included in the "standard" WMS UI, or if a specific CREAM UI profile have to be created

Sumary of next steps (as of 14/1/08)

  • 17-Jan-2008: new ICE client is finalized
  • 21-Jan-2008: yaim based installation procedure updated
  • 17-Jan-2008: ICE and CREAM ready for stress tests (done by developers)
  • 31-Jan-2008: ICE and CREAM ready for stress tests done by independent testers
  • 10-Feb-2008: ICE and CREAM ready for certification

Detailed Check List

These notes describe the check list for a release candidate CE.

Nr Task Who Priority Needed when Verified Status Details ETA
1. Installation
1.1 Package dependencies defined Luigi (JRA1) - Sara, Simone (SA1) 10 before first certification No Done according to development team. To be confirmed by SA3    
1.2 No redundant packages Luigi (JRA1) - Sara, Simone (SA1) 5 a.s.p. No Done according to development team. To be confirmed by SA3   -
1.3 Common packages, including external packages, versions should be consistent with other node types (JRA1) - Sara, Simone (SA1) 4 a.s.p. No Done according to development team. To be confirmed by SA3   -
1.4 The file locations should follow the standard convention Luigi, Alvise (JRA1) 6 before first rollout No

Done according to developemnt team (logs moved to /opt/glite/var/log).

To be confirmed by SA3

   
1.5 Build on ETICS for SL4 with VDT-1.6 (JRA1) 10 before first rollout OK Done   -
2. Configuration
2.1 YAIM will be used and should be compatible with the component centric YAIM architecture and only configure what is needed Sara, Simone, Cristina (SA1) 10 before first rollout No

Done according to development team.

To be verified by SA3

   
3. Documents
3.1 Release notes Luigi, Alvise (JRA1) 10 before PPS No Done according to development team. To be confirmed by SA3 CREAM release notes published at: http://grid.pd.infn.it/cream/field.php?n=Main.ReleaseNotes and updated whenever a new version is released -
3.2 User guide for the clients (JRA1) - (SA3) 8 before PPS No Done according to development team. To be confirmed by SA3 For submissions to CREAM via WMS no specific guide is needed (i.e. the WMS guide is the proper documentation) since knowing the CE type is not important. For direct submissions to CREAM (i.e. bypassing the WMS) a CREAM user guide along with a CREAM JDL guide is available in the CREAM web site (htp://grid.pd.infn.it/grid).  
3.3 Basic guide for operations covering the different deployment scenarios (SA3) - (SA1) - (JRA1) 8 before prod No In progress Besides the documentation for the yaim based installation and configuration, some documentation targeted to sysadmins is available in the CREAM web site (http://grid.pd.infn.it/cream) under "Administrator Guides". This is being augmented -
4. Functionality
4.1 Accounting system, APEL has to work Alessio, Elisabetta (SA3) 10 before PPS No In progress This was tested for LSF. The records get properly accounted, but it looks like there is a bug in APEL (#30041). Tests to be done for Torque.  
4.2 Information system, BDII will be used and should be able to publish VO tag (gridftp server is needed) and other runtime environment, correctly publish static and dynamic information using glue schema (version >= 1.3), sanity check Cristina, Sara, Simone (SA1) 10 before PPS No Done according to development team. To be confirmed by SA3 There isn't anything specific to CREAM. It is exactly the same stuff used in LCG CE and gLite CE. Done when Task 2.1 is done  
4.3 Security, proxy with VOMS extension has to be supported, CRL update Luigi (JRA1) 9 before PPS No Done according to development team. To be confirmed by SA3 - -
4.4 Job submission through WMS and CLI on UI Luigi (JRA1) 9 before PPS Yes Done Job submissions to CREAM is already possible via the WMS and also interacting directly with CREAM (i.e. bypassing the WMS). A "official" CREAM CLI exists -
4.5 Job submission through Condor-G Massimo, Francesco, Luigi (JRA1) - Condor 7 later No In progress

Some work was done

Need to re-contact Condor guys since the CREAM interface had to be changed

 
4.6 Batch system support, start with torque and LSF, Condor and SGE later Alessio, Elisabetta, Mezzadri, Prelz (SA3) - Luigi (JRA1) 8 before PPS No In progress

The interaction with the batch system is fully managed by BLAH, which already supports Torque/PBS and LSF (submissions to these batch systems via CREAM has been verified). The BLAH BLparser hasbeing reimplemented, also to facilitate the porting to new batch systems. This modification required some changes in the CREAM code as well. A first implementation of this new BLAH BLParser supporting Condor has been done. Basic tests have been done at PIC (submissions via WMS and via CREAM-CLI) and it seems working (so far the only seen problem is that the ReallyRunning event is not logged by the LRMS: to be investigated).

PIC people are going to do more tests.

When the new BLAH model will prove to be reliable, it will be used also for LSF and PBS.

-
4.7 Support passing parameters to the batch systems Luigi, Alvise (JRA1) - Elisabetta (SA3) 7 later No Done according to development team. To be confirmed by SA3 CREAM implements this feature via Blah, in the same way done in the gLite CE. So the JDL 'Requirements' attributes listed as 'CeForwardParameters' in the WMS conf. file are forwarded to BLAH (as 'CERequirements' in the classad sent to BLAH). Then the "local" scripts, invoked by the BLAH submission scripts, have to be properly customized by the local sysadmin. This is explained in patch https://savannah.cern.ch/patch/?func=detailitem&item_id=1044 and in https://twiki.cern.ch/twiki/bin/view/EGEE/INFN_Test_Results. For direct submissions to the CREAM CE, the CREAM JDL 'CERequirements' attribute can be used, as documented in the CREAM JDl guide  
4.8 Support stdout and stderr monitoring Luigi, Paolo (JRA1) 5 later No Done according to development team. To be confirmed by SA3 Supported via 'Job perusal', for jobs submitted to CREAM via WMS and also directly from UI  
4.9 Support MPI Luigi, Paolo (JRA1) - Barbera (NA4) 5 later No Done according to development team. To be confirmed by SA3 MPI jobs supported for jobs submitted to CREAM via WMS and also directly from UI. Implemented the new functionality requested by the MPI WG of TCG  
4.10 Proxy renewal Alvise, Moreno, Luigi (JRA1) Alessio, Elisabetta (SA3) 10 before PPS no In progress

Done

Known issue:

from time to time BLAH reports that the proxy renewal operation was successfully done, while the proxy was not actually renewed.

 
4.11 Support more than 5000 simultaneous jobs, less than 0.5% jobs fail due to CE (JRA1) 9 before PPS yes In progress

This was demostranted in the CREAM tests done in the summer (see the test results).

Being re-tested with the redesigned CREAM-ICE

-
5. Operations
5.1 Port list (JRA1) - (SA3) 10 before certification no Done according to development team. To be confirmed by SA3 List published in http://grid.pd.infn.it/cream/field.php?n=Main.PortsUsedInACREAMCE and communicated to John White for its inclusion in org.glite.site-info.ports/doc/middleware-ports.txt 24/09/2007
5.2 Long time unattended running, more than 5 days, eventually extend to 1 month (JRA1) - (SA3) 8 later No In progress To be tested -
5.3 Logfile rotation Sara, Simone, Cristina (SA1) 7 before prod No Done according to development team. To be confirmed by SA3 CREAM and CEMon log file rotation implemented via log4j. For the other log files (glexec, blah) log rotation implemented within YAIM  
5.4 Audit trace management Luigi (JRA1) 10 before PPS No Done according to development team. To be confirmed by SA3 All the accesses are properly logged in the CREAM and glexec log files (the verbosity can be tuned)  
5.5 All services should be up after rebooting, and less than 0.5% jobs lost Paolo (JRA1) 6 later no Blocked This was already demonstrated during the summer tests: with a restart of the service very few jobs got lost.
However there is a known issue happening just after the restart of the service (bug #22437). The new voms (1.8) software is supposed to address this issue. Its integration requires some changes in the CREAM code (being done), but first the integration should be done on util-java, authz-framework and delegation-java
 
5.6 Clean up pool accounts for dynamic mapping Sara, Simone, Cristina (SA1) - (JRA1) 10 before prod No Done according to development team. To be confirmed by SA3 Done by lcg-expiregridmapdir cron job -
5.7 Clean up obsolete and temporary files, specially the files under the home directories of pool accounts Alessio, Elisabetta (SA3) 5 before prod No Done according to development team. To be confirmed by SA3 Done by cleanup-grid-accounts cron job -
5.8 SAM monitoring integration Sara, Simone, Cristina (SA3) 8 later no in progress Need to contact SAM people to understand in detail what has to be done (e.g. are there some templates that can be considered ?). This will start when task 2.1 is done  
5.9 Verify that no serious memory leaks are present Alvise (JRA1) 9 before prod No In progress

CREAM and ICE seems ok. There is a memory leak in classad.jar Pinged many times Condor people to have the new jclassad with this leak fixed. For the time being need to replace the classad.jar with a patched one as post-install task.

Serious leaks in ICE fixed, but some are still there.

Implemented suicidal patch (under tests)

 

Test results for LCG-CE on SL4: LCG-CE

Test results for gLite-CE on SL3: gLite-CE SL3

Test results for cream on SL3: cream SL3

-- Main.markusw - 09 Aug 2007 -- DiQing - 09 Aug 2007

Edit | Attach | Watch | Print version | History: r43 < r42 < r41 < r40 < r39 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r43 - 2008-03-10 - MassimoSgaravatto
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback