Cream CE Pilot Service: Description and Status.


  • Start Date: 10 Jun 2008
  • End Date (tentative): 17 Mar 2009
  • Description: Pilot Service of Cream CE
  • Coordinator: Antonio Retico
  • Contact e-mail: egee-pps-pilot-cream@cern.ch
  • Status : In progress

Description

JRA1, SA3 and SA1 are organising a pilot service focused on the new Cream CE in order to collect feedback from the experiments and to accelerate the testing and deployment in production of the new service.

The pilot will be organised in two phases:

  • 1st phase: Some of the PPS sites will be gradually requested to replace their lcg-CE with CREAM. We will start with one site, published in the PPS BDII and then extend the testbed as needed. The aim of this phase is to fine-tune the installation tools (YAIM and release notes), and to verify the correct interactions of the new services with the monitoring tools. In addition to that, 1 WMS in PPS will need to be adapted to submit to cream CEs. So, initially, up to two PPS sites will be needed to support this scenario, to grow to some more (ideally one per batch-system)

  • 2nd phase: to start as soon as the installation is stable and the service has been demonstrated to be working and interacting correctly with the other components. Some production sites will be asked to add/replace one or more Cream CEs. to be published with GlueServiceStatus = 'production'. The LHC experiments will be involved in this phase to start a controlled submission of production jobs to the new service.

It is important to point out that this activity is by no means meant to replace the standard certification of the service. The certification will be carried out in parallel in the usual way and in close synergy with the pilot, so that ideally both environment will profit of the findings from the other. Sites administrators involved will have

  • to react promptly to possible issues found
  • to keep in touch with JRA1 people
  • to apply the fixes they provide
  • to communicate and keep track of them

Overall Planning

Phase1

  • Initial plan for phase 1:
    creampilotph1.gif

  • Initial roadmap for the test in PPS:
    1. Set-up of Cream CE on torque at PPS-CNAF (eventually replacing cert-ce-03.cnaf.infn.it) and FZK-PPS (timeline: 1 week)
    2. Enabling ICE at SCAI-PPS (of FZK-PPS as back-up) (timeline: 1 week, in parallel with 1)
    3. Verification/fixing of SAM monitoring chain in PPS (the SAM client at PPS-RAL should switch to use the Cream-enabled WMS) (timeline: 1.5 weeks, starting from first successful job submission)
    4. Extension of the tests to other supported batch systems/platforms. I think that IN2P3-CC-PPS, PIC for LSF and possibly some other sites could get involved here, to be seen.
    5. Getting ready for phase 2) (PIC?, CNAF?, IN2P3?)
      The named CE machines will be exonerated from applying the standard PPS updates for the whole duration of the pilot. The WMS instead will need special care because the non-standard extra configuration will have to be maintained throughout possible future service upgrades

Phase2

  • Phase2 - Initial Planning:
    Cream-phase2-initial-planning-081002.gif

Phase2, started on the 1st October 2008. It is focused on the performances of the ICE WMS

The objective of Phase2 is to enable CMS users to submit continuously at a rate of 10Kjob/day over 5 weeks

The proposed layout for the pilot is

  • 2 WMSs: CNAF and (FZK or SCAI)
  • 1 UI CNAF: cert-ui-01.cnaf.infn.it
  • 1 BDII CNAF: including services in pilot + LCG production
  • 1 VOBOX(if needed): FZK
  • CREAM CEs: FZK (Angela Poschlad), Padova (14 ones, 7 PBS , 7 LSF; resp. Sara Bertocco), Bari (Giacinto Donvito), CNAF (~ 10 ones, resp Daniele Cesini): SCAI (Klare Cassirer?)
    The CREAM CEs will access production batch systems

Requirements for the software to be deployed in the pilot:

  1. CVS tag defined and yum repository (at CNAF) available
  2. SW documented in patches in state >= "With provider"
  3. SW accompanied by a "certification of deployability" delivered by SA3 Italy (Alessio Gianelle)

Methodology:

- updates of the SW at sites, if needed, will be applied by SA1 personnel on demand of the developers - patches "approved" by the pilot will be set to "ready for certification"

Description of the patches to be expected (provided off-line by Massimo):

  1. new version of ICE to install over a WMS with PATCH:1841
  2. new version of yaim-wms for possible modifications in configuration
  3. a patch of CREAM/CEMon/BLAH (rpms glite-ce* to be installed on CREAM CE)
  4. a patch of yaim-cream-ce
  5. a patch of CREAM-CLI

Patches in c/d/e/ could be released earlier to the standard certification track to fix possible issues in the CREAM now in production

  • Initial roadmap:
    1. Prepare formal tasks for SA1 participants (timeline: ~3days)
    2. Integrate three new sites in the pilot framework (timeline: ~4days in parallel with 1)
    3. Prepare patches (timeline: ~4 days in parallel with 1)
    4. Upgrade of the sites in the pilot (timeline:~1 week)
    5. Open to users

Technical documentation

Installation of Cream CE

To install a CREAM CE please follow the instructions reported at:

http://igrelease.forge.cnaf.infn.it/doku.php?id=doc:guides:devel:install-cream31-devel

Some remarks:

  • In the instructions, in the " YUM repository setup " section, three possible options are listed. Please refer to the third one ( For the CREAM PPS pilot related activities)
  • Please check (and in case remove) other CREAM related repo files in /etc/yum.repos.d (e.g. coming from older installations)
  • In the instructions, you will note that sometimes different actions must be taken if you are considering RPMs v. 1.8.x (patch #1755) or if you are instead considering RPMs v. greater than 1.8.x. We are in the latter scenario (i.e. rpms v. greater than 1.8.x)
  • Please publish TestbedB as value to be published for the GlueCeStateStatus attribute. This is done, as explained in Appendix A, via the variable CREAM_CE_STATE, to be specified in <your-siteinfo.def-dir>/services/glite-creamce

Please then publish your CREAM CEs in a dedicated CREAM site-bdii (you can host this CREAM site-bdii in one of the CREAM CE), Please choose a new (i.e. non existing) basedn for this CREAM site bdii.

Installation of ICE WMS (Phase2)

To install a WMS ICE enabled please follow the usual instructions reported here referring to the installation of "glite-WMS".

However there are some differences to consider with respect to the official guide:

  1. The middleware repository (see "The middleware repositories" section): besides the production one, please also consider this file to be copied in /etc/yum.repos.d
  2. Since the ICE RPM is not yet in the glite-WMS metapackage used in production, please also install it doing yum install glite-wms-ice
  3. Modify the ICE section in /opt/glite/etc/glite_wms.conf as specified below, and then restart ICE (/opt/glite/ect/init.d/glite-wms-ice restart)

 ICE =  [
    job_cancellation_threshold_time   =   300;
    max_logfile_size   =   200*1024*1024;
    cream_url_prefix   =   "https://";
    start_subscription_updater   =   true;
    max_ice_mem  =  4000000;
    subscription_update_threshold_time   =   3600;
    ice_log_level   =   700;
    listener_enable_authn   =   true;
    listener_enable_authz   =   true;
    poller_status_threshold_time   =   30*60;
    lease_update_frequency   =   20*60;
    ice_host_key   =   "/home/glite/.certs/hostkey.pem";
    lease_delta_time   =   0;
    cream_url_postfix   =   "/ce-cream/services/CREAM2";
    subscription_duration   =   86400;
    InputType   =   "filelist";
    start_lease_updater   =   false;
    notification_frequency   =   3*60;
    creamdelegation_url_postfix   =   "/ce-cream/services/gridsite-delegation";
    ice_host_cert   =   "/home/glite/.certs/hostcert.pem";
    cemon_url_postfix   =   "/ce-monitor/services/CEMonitor";
    ice_topic   =   "CREAM_JOBS";
    start_proxy_renewer   =   true;
    Input   =   "/var/glite/ice/ice_fl";
    max_logfile_rotations   =   100;
    ice_empty_threshold  =  600;
    start_poller   =   true;
    listener_port   =   7010;
    poller_delay   =   2*60;
    creamdelegation_url_prefix   =   "https://";
    purge_jobs   =   false;
    start_listener   =   true;
    proxy_renewal_frequency   =   600;
    bulk_query_size   =   100;
    soap_timeout   =   60;
    persist_dir   =   "/var/glite/ice/persist_dir";
    log_on_file  =  true;
    logfile   =   "/var/log/glite/ice.log";
    max_ice_threads   =   20;
    start_job_killer   =   false;
    log_on_console  =  false;
    cemon_url_prefix   =   "https://";
    ];

yum Repositories for Phase2

Here's the instructions to use them:

SLC4-native-compiled i386 glite-CREAM_ce 3.1:

It is a yum repository:

  • create the file /etc/yum.repos.d/glite-CREAM_ce_i386.repo containing the following lines:
[glite-CREAM_ce_i386]
name= glite-CREAM_ce i386
baseurl=http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/CREAM_CE/glite-CREAM_CE/sl4/i386/
enabled=1
  • run: yum update

SLC4-native-compiled i386 ICE enabled glite-WMS_ICE 3.1:

It is a yum repository:
  • create the file /etc/yum.repos.d/glite-CREAM_ce_i386.repo containing the following lines:
[glite-WMS_ICE_i386]
name= glite-WMS_ICE i386
baseurl=http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/CREAM_CE/glite-WMS_ICE/sl4/i386/
enabled=1
  • run: yum update

SLC4-native-compiled i386 ICE enabled glite-LB 3.1:

It is a yum repository:

  • create the file /etc/yum.repos.d/glite-LB_i386.repo containing the following lines:
[glite-LB_i386]
name= glite-LB i386
baseurl=http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/CREAM_CE/glite-LB/sl4/i386/
enabled=1
  • run: yum update

Pilot Layout

Phase1 (closed on September)

  • Cream CEs:
-CNAF: cert-ce-03.cnaf.infn.it + 4 virtual WNs using pbs, 7 queues (alice,atlas,cms,lhcb,ops,dteam) pps
-FZK: pps-cream-fzk.gridka.de

  • ICE WMS:
-FZK: pps-rb-fzk.gridka.de
-SCAI: glite-wms2.scai.fraunhofer.de

  • Available CLIs:
-CNAF: cert-ui-01.cnaf.infn.it
-FZK: pps-vobox-fzk.gridka.de (alice prod setup)

Phase2

UI

WMS + ICE

  • cert-rb-01.cnaf.infn.it (CMS-CRAB test)
  • glite-wms2.scai.fraunhofer.de

Cream CEs

LSF Cream

INFN-PADOVA

  • cream-10 SL4, batch master LSF
  • cream-21 SL4, CE with LSF
  • cream-22 SL4, CE with LSF
  • cream-23 SL4, CE with LSF
  • cream-24 SL4, CE with LSF
  • cream-25 SL4, CE with LSF
  • cream-26 SL4, CE with LSF
  • cream-27 SL4, CE with LSF - site BDII
  • prod-wn-001 SL4, WN with LSF
  • prod-wn-002 SL4, WN with LSF
  • prod-wn-003 SL4, WN with LSF
  • prod-wn-004 SL4, WN with LSF
  • prod-wn-005 SL4, WN with LSF

PBS Cream

INFN-PADOVA

  • cream-28 SL4, CE with pbs - batch master
  • cream-29 SL4, CE with pbs
  • cream-30 SL4, CE with pbs
  • cream-31 SL4, CE with pbs
  • cream-32 SL4, CE with pbs
  • cream-33 SL4, CE with pbs
  • cream-34 SL4, CE with pbs - site BDII
  • prod-wn-006 SL4, WN with PBS
  • prod-wn-007 SL4, WN with PBS
  • prod-wn-008 SL4, WN with PBS
  • prod-wn-009 SL4, WN with PBS
  • prod-wn-010 SL4, WN with PBS

CNAF LSF CREAM CE CLUSTER

  • cert-04.cnaf.infn.it SL4 LSF CE
  • cert-05.cnaf.infn.it SL4 LSF CE
  • cert-06.cnaf.infn.it SL4 LSF CE
  • cert-07.cnaf.infn.it SL4 LSF CE
  • cert-08.cnaf.infn.it SL4 LSF CE
  • cert-09.cnaf.infn.it SL4 LSF CE
  • cert-13.cnaf.infn.it SL4 LSF CE

CNAF PBS CREAM CE CLUSTER

  • cert-ce-03.cnaf.infn.it SL4 PBS CE (CMS-CRAB test)

Site BDII:

  • ldap://cream-27.pd.infn.it:2170/mds-vo-name=INFN-PADOVA-CREAMTEST,o=grid
  • ldap://cream-34.pd.infn.it:2170/mds-vo-name=INFN-PADOVA-CREAMTEST-PBS,o=grid
  • INFN-BARI-CREAMTEST ldap://cream-ce-1.ba.infn.it:2170/mds-vo-name=INFN-BARI-CREAMTEST,o=grid
  • INFN-CNAF-CREAM ldap://cert-07.cnaf.infn.it:2170/mds-vo-name=INFN-CNAF-CREAM,o=grid

Results

Phase1

Feedback from the experiments

Alice discussed the progresses on the Pilot in several task forces meetings

General comments on installatIon procedure

Special configuration done in FZK to support Alice

SAM

SAM test for cream are now available in the PPS SAM instance

Nagios

No progresses with Nagios were recorded during phase1. The testing activity is transferred to Phase2

Phase2

Credits

This pilot has been conceived taking into account the plans of JRA1/SA3 developed within the cluster of competence and the SA3 guidelines for certification

CREAM and ICE precertification
================================

Foreword
-------
The new certification model in the EGEE-III project foresees a
pre-certification phase done by the so-called "cluster of competence".
A cluster of competence in charge of the pre-certification of a certain
software component is composed by the JRA1 developers of that component,
and by the SA3 people close (i.e. local) to them. The collaboration
of SA1 is also foreseen.
The pre-certification phase is then followed by a formal certification process
usually done by a partner different than the one in charge of the development
and of the pre-certification of that component.
This formal certification phase is supposed to be very quick, since most (all)
of the problems should have been found and addressed during the
pre-certification step.

So the Italian cluster of competence is in charge of the pre-certification
of CREAM and ICE


Testbeds
--------
For the pre-certification of CREAM and ICE, we envisage the use of
2 testbeds:

- Testbed a: small testbed, supposed to be used by the cluster of competence
  people, that is by the CREAM and ICE developers and by the SA3 Italian
  people. This testbed is supposed to be used mainly for functionality tests
  and for some limited stress tests.


- Testbed b: larger testbed, supposed to be used mainly by experiment people,
  and in particular for scalability tests.
  This testbed basically corresponds to the "experimental services" considered
  so far for the WMS


Hardware requirements for pre-certification testbeds
----------------------------------------------------
For testbed a:
         1 UI node
    1 WMS node
    1 LB node
    1 BDII comprising both CREAM and LCG based CEs (the LCG based CEs
           are the production ones)
    4 CREAM based CEs, possibly distributed in different sites

For testbed b:
    2 WMS nodes
    1 LB node
    1 BDII comprising both CREAM and LCG based CEs (the LCG based CEs are
           the production ones): this can be the same BDII used in testbed a
    At least 20 CREAM based CEs, possibly distributed in different sites

For what concerns the WNs, it is not necessary to have dedicated machines
for such testbeds but the same WNs used in production can be used.
It is just a matter to reserve a certain number of slots (e.g. 50) to the
queues dedicated to the CREAM pre-certification tests.


Configuration of the testbeds
-----------------------------
On both testbeds it is suggested to devote 2 queues per CREAM CE (so 2 CEIDs
per CREAM CE machine) to these tests.

For what concerns the VOs to enable, on testbed b the "production" VOs should
be authorized. On testbed a also the 21 "fake" VOs should be enabled, so
that testers can perform tests submitting jobs on behalf of multiple users
belonging to different VOs.

Updates on the testbeds
-----------------------
Both testbeds are supposed to be updated (WMSes and/or CREAM CEs) whenever
a new blocking issue is found and/or whenever a certain number of new
fixes for non-blocking problems should be tested.
It is supposed that testbed a will be updated much more often than
testbed b.
Testbed b should be updated only after having updated testbed
a, and after having tested (by the cluster of competence people) on
the testbed a the updated version.

Software deployed on the certification testbeds must be tagged
(tags will be done on the proper "pre-certification" CVS branches)


-------- Original Message --------
Subject: CREAM testing plan
Date: Fri, 23 May 2008 15:12:20 +0200
From: Oliver Keeble <oliver.keeble@cern.ch>
To: Markus Schulz <Markus.Schulz@cern.ch>,  Francesco Giacomini <francesco.giacomini@cnaf.infn.it>, Di Qing <Di.Qing@cern.ch>


My summary of the plan;

Plan and criteria for CREAM certification

Two broad criteria

   * scalability at the level previously defined in CE acceptance criteria
   * functionality verified based on the CLI spec, direct verification of
the web service interface, and a set of CREAM tests functionally
equivalent to those currently run against the lcg-CE

Andreas to organise with Massimo a functional test plan, and to find
resources for writing the tests (should not take more than a week).

CERN will add extra resources to the test infrastructure currently used
to validate the lcg-CE, including a CREAM CE and a production-level
(non-ICE) WMS. CERN will run the tests, including the 5 day soak.

Comment - we are *not* certifying ICE, and would like to avoid
difficulties in interpreting test results if possible. Di will make a
judgement as to whether we can simply loop over the CREAM CLI tests we
will have in order to do the appropriate scalability validation. If so,
this is the approach we will take. If not, we will take the ICE rpms and
upgrade the testbed WMS. There's is a question over testing
proxy-renewal if we don't use the WMS.

When CREAM is released to production, it will be advertised as being
made available for larger sites to install in parallel with their
existing lcg-CEs, not as a replacement for them. In this way we will
soon have a pool of CREAM CEs exposed to production work patterns and
loading, without endangering availability of resources.

After initial release, responsibility for CREAM scalability/stress
testing will pass to INFN and CERN certification will not invoke such tests.

History

25-Jun-08: SAM - duplication of SAM sensor slowed down by SAM unavailability

2-Jul-08: The decision is made to extend phase1 until the 22nd of July (see PPIslandFollowUp2008x07x01)

22-Jul-08: The decision is made to extend phase1 until the 26th of August (see PPIslandFollowUp2008x07x22)

2-Sep-08: The decision is made to extend phase1 until the 30th of September (see PPIslandFollowUp2008x09x02)

1-Oct-08: The decision is made to close phase1 and start phase2 (see PPIslandFollowUp2008x10x01)

28-Nov-08: New version of CREAM available for the installation on the CREAM PPS pilot

09-Dec-08: Alice is starting a new stream of activity on the pilot at CNAF

09-Dec-08: Pilot end-date moved to end of January.

11-Dec-08: A request was sent to PPS site admins and the EGEE regional managers to join for an extension of the pilot

12-Dec-08: There is a new yaim-cream-ce (v. 4.0.7-2) in the YUM repo for the CREAM PPS pilot (PATCH:2667).

13-Jan-09: A new version of CREAM was release to the pilot. This version fixes BUG:45437 and BUG:45736.

13-Jan-09: within the SA1 coordination meeting the SA1 ROCs were invited to use the pilot version of CREAM for their regional installation

13-Jan-09: Stress test of the ICE+CREAM submission chain: A submission rate of 40 job/min was sustained but a failure rate higher that expected was observed. The issue is currently under analysis

13-Jan-09: Pilot end-date moved to mid-March.

20-Jan-09: Alice tested successfully the CLI using the CE at FZK.

03-Feb-09: PIC joined the pilot with a setting-up multiple CREAM CEs accessing the production queues

03-Feb-09: The high failure rate observed affecting the ICE+CREAM submission chain wasanalised and the causes fixed. Now the system sustains correctly a submission rate of 40 jobs/min. Stress tests with long lasting jobs seem to show performance issues when the number of active jobs in the system increases:

03-Feb-09:Alice requests CERN to propose a timeline for the deployment of CREAM in production

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif Cream-phase2-initial-planning-081002.gif r1 manage 11.0 K 2008-10-03 - 11:37 AntonioRetico Phase2 - Initial Planning
GIFgif creampilotph1.gif r1 manage 6.4 K 2008-06-26 - 01:04 AntonioRetico Initial plan for phase 1
Edit | Attach | Watch | Print version | History: r55 | r40 < r39 < r38 < r37 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r38 - 2009-03-02 - AntonioRetico
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback