These notes describe the check list for a release candidate CE.
Nr |
Task |
Who |
Priority |
Needed when |
Verified |
Status |
Details |
ETA |
1. Installation |
1.1 |
Package dependencies defined |
Luigi (JRA1) - Sara, Simone (SA1) |
10 |
before first certification |
In progress |
In progress |
Done, but being checked with task 2.1 |
|
1.2 |
No redundant packages |
Luigi (JRA1) - Sara, Simone (SA1) |
5 |
a.s.p. |
yes |
OK |
configurations have been updated; waiting for tests and final review |
- |
1.3 |
Common packages, including external packages, versions should be consistent with other node types |
(JRA1) - Sara, Simone (SA1) |
4 |
a.s.p. |
no |
OK |
list available to be validated by Integration Team |
- |
1.4 |
The file locations should follow the standard convention |
Luigi, Alvise (JRA1) |
6 |
before first rollout |
no |
in progress |
verfying standard conventions in the developers guide |
24/09/2007 |
1.5 |
Build on ETICS for SL4 with VDT-1.6 |
(JRA1) |
10 |
before first rollout |
OK |
Done |
As of Nov. 12, 2007, all *.ce modules and ice build properly. See ETICS build reports |
- |
2. Configuration |
2.1 |
YAIM will be used and should be compatible with the component centric YAIM architecture and only configure what is needed |
Sara, Simone, Cristina (SA1) |
10 |
before first rollout |
no |
In progress |
CREAM now has its own YAIM cvs module (glite-yaim-cream-ce). Started some tests (by Di) and some problems founds. These problems are being addressed. Sara is going to Cern to fasten the process |
03/10/2007 |
3. Documents |
3.1 |
Release notes |
Luigi, Alvise (JRA1) |
10 |
before PPS |
no |
Done according to development team. To be confirmed by SA3 |
CREAM release notes published at: http://grid.pd.infn.it/cream/field.php?n=Main.ReleaseNotes and updated whenever a new version is released |
- |
3.2 |
User guide for the clients |
(JRA1) - (SA3) |
8 |
before PPS |
no |
Done according to development team. To be confirmed by SA3 |
For submissions to CREAM via WMS no specific guide is needed (i.e. the WMS guide is the proper documentation) since knowing the CE type is not important. For direct submissions to CREAM (i.e. bypassing the WMS) a CREAM user guide along with a CREAM JDL guide is available in the CREAM web site (htp://grid.pd.infn.it/grid). |
30/09/2007 |
3.3 |
Basic guide for operations covering the different deployment scenarios |
(SA3) - (SA1) - (JRA1) |
8 |
before prod |
no |
In progress |
Besides the documentation for the yaim based installation and configuration, some documentation targeted to sysadmins is available in the CREAM web site (http://grid.pd.infn.it/cream ) under "Administrator Guides". This is being augmented |
- |
4. Functionality |
4.1 |
Accounting system, APEL has to work |
Alessio, Elisabetta (SA3) |
10 |
before PPS |
no |
In progress |
This was tested for LSF. The records get properly accounted, but it looks like there is a bug in APEL (#30041). Tests to be done for Torque. |
27/09/2007 |
4.2 |
Information system, BDII will be used and should be able to publish VO tag (gridftp server is needed) and other runtime environment, correctly publish static and dynamic information using glue schema (version >= 1.3), sanity check |
Cristina, Sara, Simone (SA1) |
10 |
before PPS |
no |
In progress |
There isn't anything specific to CREAM. It is exactly the same stuff used in LCG CE and gLite CE. Done when Task 2.1 is done |
01/10/2007 |
4.3 |
Security, proxy with VOMS extension has to be supported, CRL update |
Luigi (JRA1) |
9 |
before PPS |
yes |
OK |
- |
- |
4.4 |
Job submission through WMS and CLI on UI |
Luigi (JRA1) |
9 |
before PPS |
yes |
OK |
Job submissions to CREAM is already possible via the WMS ann also interacting directly with CREAM (i.e. bypassing the WMS). A "official" CREAM CLI exists |
- |
4.5 |
Job submission through Condor-G |
Massimo, Francesco, Luigi (JRA1) - Condor |
7 |
later |
no |
in progress |
the integration of CREAM and Condor-G already started; some simple jobs have been correctly submitted to CREAM (problem with output sandbox transfering); basic Condor-G->CREAM operations implemented (to be tested); CEMon integration for async notification of job status changes (to be done) |
31/10/2007 |
4.6 |
Batch system support, start with torque and LSF, Condor and SGE later |
Alessio, Elisabetta, Mezzadri, Prelz (SA3) - Luigi (JRA1) |
8 |
before PPS |
no |
in progress |
The interaction with the batch system is fully managed by BLAH, which already supports Torque/PBS and LSF (submissions to these batch systems via CREAM has been verified). The BLAH BLparser is being reimplemented, also to facilitate the porting to new batch systems. This modification will require some changes in the CREAM code as well. A first implementation of this new BLAH BLParser supporting Condor is expected by end of November. . The teams responsible for Condor and SGE support have been informed that customizing the current implementation of the code doesn't make too sense since, as said above, BLAH BLParser is being redesigned |
- |
4.7 |
Support passing parameters to the batch systems |
Luigi, Alvise (JRA1) - Elisabetta (SA3) |
7 |
later |
no |
in progress |
CREAM implements this feature by Blah; installed necessary .rpms as described in patch https://savannah.cern.ch/patch/?func=detailitem&item_id=1044 that was for the gLite CE |
11/10/2007 |
4.8 |
Support stdout and stderr monitoring |
Luigi, Paolo (JRA1) |
5 |
later |
no |
OK |
Job perusal works as expected. |
24/09/2007 |
4.9 |
Support MPI |
Luigi, Paolo (JRA1) - Barbera (NA4) |
5 |
later |
no |
OK |
to be verified |
24/09/2007 |
4.10 |
Proxy renewal |
Alvise, Moreno, Luigi (JRA1) Alessio, Elisabetta (SA3) |
10 |
before PPS |
no |
in progress |
In the current implementation of CREAM/ICE, proxy renewal is implemented, but there are known problems occuring when the load of the system is high. This is being addressed now. This required a code redesign both in ICE and CREAM (e.g. in CREAM a DB will be used for the backend). This work is also going to improve the scalability and the efficiency of the system, but is taking more than originally expected |
19/10/2007 |
4.11 |
Support more than 5000 simultaneous jobs, less than 0.5% jobs fail due to CE |
(JRA1) |
9 |
before PPS |
yes |
OK |
This was demostranted in the CREAM tests done in the summer (see the test results). This will have to be re-demonstrated when the on-going redesign of the system (see task 4.10) is done |
- |
5. Operations |
5.1 |
Port list |
(JRA1) - (SA3) |
10 |
before certification |
no |
Done according to development team. To be confirmed by SA3 |
List published in http://grid.pd.infn.it/cream/field.php?n=Main.PortsUsedInACREAMCE and communicated to John White for its inclusion in org.glite.site-info.ports/doc/middleware-ports.txt |
24/09/2007 |
5.2 |
Long time unattended running, more than 5 days, eventually extend to 1 month |
(JRA1) - (SA3) |
8 |
later |
no |
In progress |
To be done when the new software (see task 4.10) is ready |
- |
5.3 |
Logfile rotation |
Sara, Simone, Cristina (SA1) |
7 |
before prod |
no |
In progress |
done by YAIM; rotated glexec, cream; blparser log rotation to be tested |
24/09/2007 |
5.4 |
Audit trace management |
Luigi (JRA1) |
10 |
before PPS |
no |
Done according to development team. To be confirmed by SA3 |
All the accesses are properly logged in the CREAM and glexec log files (the verbosity can be tuned) |
10/10/2007 |
5.5 |
All services should be up after rebooting, and less than 0.5% jobs lost |
Paolo (JRA1) |
6 |
later |
no |
blocked |
still failing the first connection after start-up, waiting for feed-back from MSWG |
1 week since unblocked |
5.6 |
Clean up pool accounts for dynamic mapping |
Sara, Simone, Cristina (SA1) - (JRA1) |
10 |
before prod |
No |
Done according to development team. To be confirmed by SA3 |
Done by lcg-expiregridmapdir cron job |
- |
5.7 |
Clean up obsolete and temporary files, specially the files under the home directories of pool accounts |
Alessio, Elisabetta (SA3) |
5 |
before prod |
no |
OK |
done by /etc/cron.d/cleanup-grid-accounts; to be verified |
- |
5.8 |
SAM monitoring integration |
Sara, Simone, Cristina (SA3) |
8 |
later |
no |
in progress |
need setup of contact from SA3 CERN |
unknown |
5.9 |
Verify that no serious memory leaks are present |
Alvise (JRA1) |
9 |
before prod |
no |
In progress |
CREAM seems ok. Memory leaks in ICE likey due to leaks in globus and gridsite libraries. Temporary fix is to implement the “harakiri patch” (more or less the “suicidal patch” used in WMProxy). ICE memory usage reduction being done as well |
|