Cream CE Pilot Service: Description and Status.
- Start Date: 10 Jun 2008
- End Date (tentative): 17 Mar 2009
- Description: Pilot Service of Cream CE
- Coordinator: Antonio Retico
- Contact e-mail:
egee-pps-pilot-cream@cern.ch
- Status : In progress
- Related meetings
Links
Criteria for transition from lcg-CE to CREAM
Description
JRA1, SA3 and SA1 are running a pilot service focused on the new Cream CE in order to collect feedback from the experiments and to accelerate the testing and deployment in production of the new service.
The pilot will be organised in two phases:
- 1st phase: Some of the PPS sites will be gradually requested to replace their lcg-CE with CREAM. We will start with one site, published in the PPS BDII and then extend the testbed as needed. The aim of this phase is to fine-tune the installation tools (YAIM and release notes), and to verify the correct interactions of the new services with the monitoring tools. In addition to that, 1 WMS in PPS will need to be adapted to submit to cream CEs. So, initially, up to two PPS sites will be needed to support this scenario, to grow to some more (ideally one per batch-system)
- 2nd phase: to start as soon as the installation is stable and the service has been demonstrated to be working and interacting correctly with the other components. Some production sites will be asked to add/replace one or more Cream CEs. to be published with GlueServiceStatus = 'production'. The LHC experiments will be involved in this phase to start a controlled submission of production jobs to the new service.
It is important to point out that this activity is by no means meant to replace the standard certification of the service. The certification will be carried out in parallel in the usual way and in close synergy with the pilot, so that ideally both environment will profit of the findings from the other. Sites administrators involved will have
- to react promptly to possible issues found
- to keep in touch with JRA1 people
- to apply the fixes they provide
- to communicate and keep track of them
An additional 3rd phase has been planned on 1st April.
- 3rd phase: it has two activities: a) WMS-ICE validation b) performance/acceptance tests for CREAM.
Overall Planning
- Initial plan for phase 1:
- Initial roadmap for the test in PPS:
- Set-up of Cream CE on torque at PPS-CNAF (eventually replacing cert-ce-03.cnaf.infn.it) and FZK-PPS (timeline: 1 week)
- Enabling ICE at SCAI-PPS (of FZK-PPS as back-up) (timeline: 1 week, in parallel with 1)
- Verification/fixing of SAM monitoring chain in PPS (the SAM client at PPS-RAL should switch to use the Cream-enabled WMS) (timeline: 1.5 weeks, starting from first successful job submission)
- Extension of the tests to other supported batch systems/platforms. I think that IN2P3-CC-PPS, PIC for LSF and possibly some other sites could get involved here, to be seen.
- Getting ready for phase 2) (PIC?, CNAF?, IN2P3?)
The named CE machines will be exonerated from applying the standard PPS updates for the whole duration of the pilot. The WMS instead will need special care because the non-standard extra configuration will have to be maintained throughout possible future service upgrades
- Phase2 - Initial Planning:
Phase2, started on the 1st October 2008. It is focused on the performances of the ICE WMS
The objective of Phase2 is to enable CMS users to submit continuously at a rate of 10Kjob/day over 5 weeks
The proposed layout for the pilot is
- 2 WMSs: CNAF and (FZK or SCAI)
- 1 UI CNAF: cert-ui-01.cnaf.infn.it
- 1 BDII CNAF: including services in pilot + LCG production
- 1 VOBOX(if needed): FZK
- CREAM CEs: FZK (Angela Poschlad), Padova (14 ones, 7 PBS , 7 LSF; resp. Sara Bertocco), Bari (Giacinto Donvito), CNAF (~ 10 ones, resp Daniele Cesini): SCAI (Klare Cassirer?)
The CREAM CEs will access production batch systems
Requirements for the software to be deployed in the pilot:
- CVS tag defined and yum repository (at CNAF) available
- SW documented in patches in state >= "With provider"
- SW accompanied by a "certification of deployability" delivered by SA3 Italy (Alessio Gianelle)
Methodology:
- updates of the SW at sites, if needed, will be applied by SA1 personnel on demand of the developers - patches "approved" by the pilot will be set to "ready for certification"
Description of the patches to be expected (provided off-line by Massimo):
- new version of ICE to install over a WMS with PATCH:1841
- new version of yaim-wms for possible modifications in configuration
- a patch of CREAM/CEMon/BLAH (rpms glite-ce* to be installed on CREAM CE)
- a patch of yaim-cream-ce
- a patch of CREAM-CLI
Patches in c/d/e/ could be released earlier to the standard certification track to fix possible issues in the
CREAM now in production
- Initial roadmap:
- Prepare formal tasks for SA1 participants (timeline: ~3days)
- Integrate three new sites in the pilot framework (timeline: ~4days in parallel with 1)
- Prepare patches (timeline: ~4 days in parallel with 1)
- Upgrade of the sites in the pilot (timeline:~1 week)
- Open to users
Phase 2 closed on 31st of April 2009 after the indication of a set of patches to be deployed ASAP in the production system as a pre-condition for phase 3 Activity a) (see
PPIslandFollowUp2009x04x01) . This issue is followed-up with
GGUS:47489
The phase three, launched during the meeting of the 1st April is composed of two activities:
- Activity a) WMS-ICE validation, this is the main activity.
Activity a)
Activity b)
Considering the points mentioned in the transition plan, and the outcomes of the
pilot meeting of the 1st of April, SA3 will focus on a subset of points, initially using the direct submission to the
CREAM CE. The points to verify are:
J
"At least 5000 simultaneous jobs per CE node". "5000 simultaneous jobs" means that the CE should be able to handle 5000 jobs submitted simultaneously to the CE. The CE should be able to handle them all the way (sending them to the batch system, polling for the status, etc.). But it doesn't mean that all the 5000 jobs will be run at the same time. They probably won't (it depends on how busy the batch system/worker nodes are).
So, as long as the 5000 simultaneously submitted jobs are properly treated by the CE, the test is passed.
K
"Unlimited number of user/role/submission node combinations from many VO's (at least 50), up to the limit of the number of jobs supported on a CE node". This test will probably be done in a small and controlled setup (non production) using a fake CA as is usually done by the certification team.
M
"Job failures due to restart of CE services or reboot < 0.1%".
O
"Graceful failure or self-limiting behavior when the CE load reaches its maximum (e.g. if a CE node can support only 5000 jobs it must not crash or become unresponsive with more than that)".
At a second time, with the use of a WMS from the pilot, these points can be evaluated:
D.i
"The ICE / CREAM job submission chain should be able to meet all performance criteria and otherwise perform at least as well as the WMS / LCG CE submission chain". This point depend on
BUG:47911
D.ii
"The ICE-WMS must deal gracefully with large peaks in the rate of jobs submitted to it.", although an agreement must be reached on what is considered as 'gracefully'.
Testing point J ("At least 5000 simultaneous jobs per CE node")
This section describe the plan for executing the tests aimed at the verification of the point J of the transition plan.
The following table shows the fixed parameters at each target site and link to the page used to collect results.
* The submission rate parameter depends on other variables, it still needs to be clarified what could be the fixed parameter.
*
MySQL host (yes/no) means whether the site has a separate
MySQL host or not.
Technical documentation
Installation of Cream CE
To install a
CREAM CE please follow the instructions reported at:
http://igrelease.forge.cnaf.infn.it/doku.php?id=doc:guides:devel:install-cream31-devel
Some remarks:
- In the instructions, in the " YUM repository setup " section, three possible options are listed. Please refer to the third one ( For the CREAM PPS pilot related activities)
- Please check (and in case remove) other CREAM related repo files in /etc/yum.repos.d (e.g. coming from older installations)
- In the instructions, you will note that sometimes different actions must be taken if you are considering RPMs v. 1.8.x (patch #1755) or if you are instead considering RPMs v. greater than 1.8.x. We are in the latter scenario (i.e. rpms v. greater than 1.8.x)
- Please publish TestbedB as value to be published for the GlueCeStateStatus attribute. This is done, as explained in Appendix A, via the variable CREAM_CE_STATE, to be specified in <your-siteinfo.def-dir>/services/glite-creamce
Please then publish your
CREAM CEs in a dedicated
CREAM site-bdii (you can host this
CREAM site-bdii in one of the
CREAM CE), Please choose a new (i.e. non existing) basedn for this
CREAM site bdii.
Installation of ICE WMS (Phase2)
To install a WMS ICE enabled please follow the usual instructions reported
here referring to the installation of "glite-WMS".
However there are some differences to consider with respect to the official guide:
- The middleware repository (see "The middleware repositories" section): besides the production one, please also consider this file
to be copied in /etc/yum.repos.d
yum Repositories for Phase2
Here's the instructions to use them:
SLC4-native-compiled i386 glite-CREAM_ce 3.1:
It is a yum repository:
- create the file /etc/yum.repos.d/glite-CREAM_ce_i386.repo containing the following lines:
[glite-CREAM_ce_i386]
name= glite-CREAM_ce i386
baseurl=http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/CREAM_CE/glite-CREAM_CE/sl4/i386/
enabled=1
SLC4-native-compiled i386 ICE enabled glite-WMS_ICE 3.1:
It is a yum repository:
- create the file /etc/yum.repos.d/glite-CREAM_ce_i386.repo containing the following lines:
[glite-WMS_ICE_i386]
name= glite-WMS_ICE i386
baseurl=http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/CREAM_CE/glite-WMS_ICE/sl4/i386/
enabled=1
SLC4-native-compiled i386 ICE enabled glite-LB 3.1:
It is a yum repository:
- create the file /etc/yum.repos.d/glite-LB_i386.repo containing the following lines:
[glite-LB_i386]
name= glite-LB i386
baseurl=http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/CREAM_CE/glite-LB/sl4/i386/
enabled=1
Yum repositories for Phase 3
Pilot Layout
Phase1 (closed on September)
-CNAF: cert-ce-03.cnaf.infn.it + 4 virtual WNs using pbs, 7 queues (alice,atlas,cms,lhcb,ops,dteam) pps
-FZK: pps-cream-fzk.gridka.de
-FZK: pps-rb-fzk.gridka.de
-SCAI: glite-wms2.scai.fraunhofer.de
-CNAF: cert-ui-01.cnaf.infn.it
-FZK: pps-vobox-fzk.gridka.de (alice prod setup)
Phase2
UI
WMS + ICE
- cert-rb-01.cnaf.infn.it (CMS-CRAB test) since 11-Mar-09 this WMS points to the production BDII
- glite-wms2.scai.fraunhofer.de
Cream CEs
LSF Cream
INFN-PADOVA
- cream-10 SL4, batch master LSF
- cream-21 SL4, CE with LSF
- cream-22 SL4, CE with LSF
- cream-23 SL4, CE with LSF
- cream-24 SL4, CE with LSF
- cream-25 SL4, CE with LSF
- cream-26 SL4, CE with LSF
- cream-27 SL4, CE with LSF - site BDII
- prod-wn-001 SL4, WN with LSF
- prod-wn-002 SL4, WN with LSF
- prod-wn-003 SL4, WN with LSF
- prod-wn-004 SL4, WN with LSF
- prod-wn-005 SL4, WN with LSF
PBS Cream
INFN-PADOVA
- cream-28 SL4, CE with pbs - batch master
- cream-29 SL4, CE with pbs
- cream-30 SL4, CE with pbs
- cream-31 SL4, CE with pbs
- cream-32 SL4, CE with pbs
- cream-33 SL4, CE with pbs
- cream-34 SL4, CE with pbs - site BDII
- prod-wn-006 SL4, WN with PBS
- prod-wn-007 SL4, WN with PBS
- prod-wn-008 SL4, WN with PBS
- prod-wn-009 SL4, WN with PBS
- prod-wn-010 SL4, WN with PBS
CNAF
LSF CREAM CE
CLUSTER
- cert-04.cnaf.infn.it SL4 LSF CE
- cert-05.cnaf.infn.it SL4 LSF CE
- cert-06.cnaf.infn.it SL4 LSF CE
- cert-07.cnaf.infn.it SL4 LSF CE
- cert-08.cnaf.infn.it SL4 LSF CE
- cert-09.cnaf.infn.it SL4 LSF CE
- cert-13.cnaf.infn.it SL4 LSF CE
CNAF PBS
CREAM CE
CLUSTER
- cert-ce-03.cnaf.infn.it SL4 PBS CE (CMS-CRAB test)
Site BDII:
- ldap://cream-27.pd.infn.it:2170/mds-vo-name=INFN-PADOVA-CREAMTEST,o=grid
- ldap://cream-34.pd.infn.it:2170/mds-vo-name=INFN-PADOVA-CREAMTEST-PBS,o=grid
- INFN-BARI-CREAMTEST ldap://cream-ce-1.ba.infn.it:2170/mds-vo-name=INFN-BARI-CREAMTEST,o=grid
- INFN-CNAF-CREAM ldap://cert-07.cnaf.infn.it:2170/mds-vo-name=INFN-CNAF-CREAM,o=grid
Phase3
Activity a)
Activity b)
This list has to be verified:
- FZK: 1 CREAM with PBS (12 cores available through the queue 'pps')
- CNAF: WMS + LSF CEs
- PIC ?
- CERN : 1 CREAM pointing to the production queues (LSF) and WNs.
- GRNET: 1 CREAM with PBS (on a separate node) and 5 WNs. CREAM MySQL DB on a separate host.
Results
Phase1
Feedback from the experiments
Alice discussed the progresses on the Pilot in several task forces meetings
General comments on installatIon procedure
Special configuration done in FZK to support Alice
SAM
SAM test for cream are now available in the
PPS SAM instance
Nagios
No progresses with Nagios were recorded during phase1. The testing activity is transferred to Phase2
Phase2
Feedback from the experiments
The whole phase2 was characterised by a reduced interaction with the experiments with respect to phase1. Two reason for that:
- the concurrent deployment of the first version of CREAM in production brought Alice to work almost exclusively with production sites
- the bulky installations at PADOVA and CNAF were often reserved for stress testing
- delays in other sites to set-up comparable alternative testbeds.
Alice confirmed that the set of patches defined in
GGUS:47489
is compatible with the use they intend to do in production
General comments on installation procedure
The installation and configuration of the products with YUM + YAIM is fully supported and of production quality.
SAM
Results of direct and WMS based submission test against the
CREAM CEs in the pilot are now available
on the SAM PPS portal (
https://pps-sam.cern.ch:8443/sam/sam.py
),
Nagios
As the SAM mechanism is now available and instances of
CREAM are available in production the responsibility of the corresponding development of Nagios was fully moved to OAT
CREAM and ICE Development
Several performance and scalability issues affecting the ICE-->
CREAM submission chain were identified and followed up. Most of them were solved by the set of patches identified by
GGUS:47489
.
The performance issue reported with
BUG:47911
and identified as a showstopper for the full replacement of the lcg-CE is understood and the solution is under testing
Credits
This pilot has been conceived taking into account the plans of JRA1/SA3 developed within the
cluster of competence and the SA3 guidelines for certification
CREAM and ICE precertification
================================
Foreword
-------
The new certification model in the EGEE-III project foresees a
pre-certification phase done by the so-called "cluster of competence".
A cluster of competence in charge of the pre-certification of a certain
software component is composed by the JRA1 developers of that component,
and by the SA3 people close (i.e. local) to them. The collaboration
of SA1 is also foreseen.
The pre-certification phase is then followed by a formal certification process
usually done by a partner different than the one in charge of the development
and of the pre-certification of that component.
This formal certification phase is supposed to be very quick, since most (all)
of the problems should have been found and addressed during the
pre-certification step.
So the Italian cluster of competence is in charge of the pre-certification
of CREAM and ICE
Testbeds
--------
For the pre-certification of CREAM and ICE, we envisage the use of
2 testbeds:
- Testbed a: small testbed, supposed to be used by the cluster of competence
people, that is by the CREAM and ICE developers and by the SA3 Italian
people. This testbed is supposed to be used mainly for functionality tests
and for some limited stress tests.
- Testbed b: larger testbed, supposed to be used mainly by experiment people,
and in particular for scalability tests.
This testbed basically corresponds to the "experimental services" considered
so far for the WMS
Hardware requirements for pre-certification testbeds
----------------------------------------------------
For testbed a:
1 UI node
1 WMS node
1 LB node
1 BDII comprising both CREAM and LCG based CEs (the LCG based CEs
are the production ones)
4 CREAM based CEs, possibly distributed in different sites
For testbed b:
2 WMS nodes
1 LB node
1 BDII comprising both CREAM and LCG based CEs (the LCG based CEs are
the production ones): this can be the same BDII used in testbed a
At least 20 CREAM based CEs, possibly distributed in different sites
For what concerns the WNs, it is not necessary to have dedicated machines
for such testbeds but the same WNs used in production can be used.
It is just a matter to reserve a certain number of slots (e.g. 50) to the
queues dedicated to the CREAM pre-certification tests.
Configuration of the testbeds
-----------------------------
On both testbeds it is suggested to devote 2 queues per CREAM CE (so 2 CEIDs
per CREAM CE machine) to these tests.
For what concerns the VOs to enable, on testbed b the "production" VOs should
be authorized. On testbed a also the 21 "fake" VOs should be enabled, so
that testers can perform tests submitting jobs on behalf of multiple users
belonging to different VOs.
Updates on the testbeds
-----------------------
Both testbeds are supposed to be updated (WMSes and/or CREAM CEs) whenever
a new blocking issue is found and/or whenever a certain number of new
fixes for non-blocking problems should be tested.
It is supposed that testbed a will be updated much more often than
testbed b.
Testbed b should be updated only after having updated testbed
a, and after having tested (by the cluster of competence people) on
the testbed a the updated version.
Software deployed on the certification testbeds must be tagged
(tags will be done on the proper "pre-certification" CVS branches)
-------- Original Message --------
Subject: CREAM testing plan
Date: Fri, 23 May 2008 15:12:20 +0200
From: Oliver Keeble <oliver.keeble@cern.ch>
To: Markus Schulz <Markus.Schulz@cern.ch>, Francesco Giacomini <francesco.giacomini@cnaf.infn.it>, Di Qing <Di.Qing@cern.ch>
My summary of the plan;
Plan and criteria for CREAM certification
Two broad criteria
* scalability at the level previously defined in CE acceptance criteria
* functionality verified based on the CLI spec, direct verification of
the web service interface, and a set of CREAM tests functionally
equivalent to those currently run against the lcg-CE
Andreas to organise with Massimo a functional test plan, and to find
resources for writing the tests (should not take more than a week).
CERN will add extra resources to the test infrastructure currently used
to validate the lcg-CE, including a CREAM CE and a production-level
(non-ICE) WMS. CERN will run the tests, including the 5 day soak.
Comment - we are *not* certifying ICE, and would like to avoid
difficulties in interpreting test results if possible. Di will make a
judgement as to whether we can simply loop over the CREAM CLI tests we
will have in order to do the appropriate scalability validation. If so,
this is the approach we will take. If not, we will take the ICE rpms and
upgrade the testbed WMS. There's is a question over testing
proxy-renewal if we don't use the WMS.
When CREAM is released to production, it will be advertised as being
made available for larger sites to install in parallel with their
existing lcg-CEs, not as a replacement for them. In this way we will
soon have a pool of CREAM CEs exposed to production work patterns and
loading, without endangering availability of resources.
After initial release, responsibility for CREAM scalability/stress
testing will pass to INFN and CERN certification will not invoke such tests.
History
25-Jun-08: SAM - duplication of SAM sensor slowed down by SAM unavailability
2-Jul-08: The decision is made to extend phase1 until the 22nd of July (see
PPIslandFollowUp2008x07x01)
22-Jul-08: The decision is made to extend phase1 until the 26th of August (see
PPIslandFollowUp2008x07x22)
2-Sep-08: The decision is made to extend phase1 until the 30th of September (see
PPIslandFollowUp2008x09x02)
1-Oct-08: The decision is made to close phase1 and start phase2 (see
PPIslandFollowUp2008x10x01)
28-Nov-08: New version of
CREAM available for the installation on the
CREAM PPS pilot
09-Dec-08: Alice is starting a new stream of activity on the pilot at CNAF
09-Dec-08: Pilot end-date moved to end of January.
11-Dec-08: A request was sent to PPS site admins and the EGEE regional managers to join for an extension of the pilot
12-Dec-08: There is a new yaim-cream-ce (v. 4.0.7-2) in the YUM repo for the
CREAM PPS pilot (
PATCH:2667
).
13-Jan-09: A new version of
CREAM was release to the pilot. This version fixes
BUG:45437
and
BUG:45736
.
13-Jan-09: within the SA1 coordination meeting the SA1 ROCs were invited to use the pilot version
of
CREAM for their regional installation
13-Jan-09: Stress test of the ICE+CREAM submission chain: A submission rate of 40 job/min was sustained but a failure rate higher that
expected was observed. The issue is currently under analysis
13-Jan-09: Pilot end-date moved to mid-March.
20-Jan-09: Alice tested successfully the CLI using the CE at FZK.
03-Feb-09: PIC joined the pilot with a setting-up multiple
CREAM CEs accessing the production queues
03-Feb-09: The high failure rate observed affecting the ICE+CREAM submission chain wasanalised and the causes fixed. Now the system sustains correctly a submission rate of 40 jobs/min. Stress tests with long lasting jobs seem to show performance issues when the number of active jobs in the system increases:
03-Feb-09: Alice requests CERN to propose a timeline for the deployment of
CREAM in production
25-Feb-09: Results of direct ans WMS based submission test available on the SAM PPS portal (
https://pps-sam.cern.ch:8443/sam/sam.py
),
11-Mar-09: ICE-WMS cert-rb-01.cnaf.infn.it reconfigured in order to allow CMS to use the production instances of
CREAM
12-Mar-09: CMS testing in progress against ice-WMS at CNAF
30-Mar-09:
GGUS:47489
opened to track the release in production of a set of patches validated by the pilot
1-Oct-08: The decision is made to close phase2 and start phase3 (see
PPIslandFollowUp2009x04x01)