Cream CE Pilot Service: Description and Status.

Description

JRA1, SA3 and SA1 are organising a pilot service focused on the new Cream CE in order to collect feedback from the experiments and to accelerate the testing and deployment in production of the new service.

The pilot will be organised in two phases:

  • 1st phase: Some of the PPS sites will be gradually requested to replace their lcg-CE with CREAM. We will start with one site, published in the PPS BDII and then extend the testbed as needed. The aim of this phase is to fine-tune the installation tools (YAIM and release notes), and to verify the correct interactions of the new services with the monitoring tools. In addition to that, 1 WMS in PPS will need to be adapted to submit to cream CEs. So, initially, up to two PPS sites will be needed to support this scenario, to grow to some more (ideally one per batch-system)

  • 2nd phase: to start as soon as the installation is stable and the service has been demonstrated to be working and interacting correctly with the other components. Some production sites will be asked to add/replace one or more Cream CEs. to be published with GlueServiceStatus = 'production'. The LHC experiments will be involved in this phase to start a controlled submission of production jobs to the new service.

It is important to point out that this activity is by no means meant to replace the standard certification of the service. The certification will be carried out in parallel in the usual way and in close synergy with the pilot, so that ideally both environment will profit of the findings from the other. Sites administrators involved will have

  • to react promptly to possible issues found
  • to keep in touch with JRA1 people
  • to apply the fixes they provide
  • to communicate and keep track of them

Overall Planning

Phase1

  • Initial plan for phase 1:
    creampilotph1.gif

  • Initial roadmap for the test in PPS:
    1. Set-up of Cream CE on torque at PPS-CNAF (eventually replacing cert-ce-03.cnaf.infn.it) and FZK-PPS (timeline: 1 week)
    2. Enabling ICE at SCAI-PPS (of FZK-PPS as back-up) (timeline: 1 week, in parallel with 1)
    3. Verification/fixing of SAM monitoring chain in PPS (the SAM client at PPS-RAL should switch to use the Cream-enabled WMS) (timeline: 1.5 weeks, starting from first successful job submission)
    4. Extension of the tests to other supported batch systems/platforms. I think that IN2P3-CC-PPS, PIC for LSF and possibly some other sites could get involved here, to be seen.
    5. Getting ready for phase 2) (PIC?, CNAF?, IN2P3?)
      The named CE machines will be exonerated from applying the standard PPS updates for the whole duration of the pilot. The WMS instead will need special care because the non-standard extra configuration will have to be maintained throughout possible future service upgrades

  • Update 25/6
    1. SAM: duplication of SAM sensor slowed down by SAM unavailability

Phase2

Technical documentation

Installation of Cream CE

We have received a set of installation instructions which we rate sufficient for an initial set-up (http://igrelease.forge.cnaf.infn.it/doku.php?id=doc:guides:devel:install-cream31-devel ). A simple set of instructions to enable the WMS is also available. They consist basically in the installation of few rpms and a modification to glite-wms.conf So we think that we can declare phase 1 open.

Enabling ICE on WMS

Starting from an official working certified WMS; i.e. installed using this repository:

http://grid-deployment.web.cern.ch/grid-deployment/glite/cert/3.1/glite-WMS/sl4/$basearch


  • Install log4cpp-0.3.4b-1.slc4

  • Set this new yum repository:
[wms-ice-enabled]
name=wms-ice-enabled
baseurl=http://devel12.cnaf.infn.it:7444/repository/wms-ice-enabled/
enabled=1

  • Then issue:

yum clean all

yum update


This should update:

Updated: glite-security-proxyrenewal.i386 1.3.5-1.slc4
Updated: glite-wms-common.i386 3.1.20-1.slc4
Updated: glite-wms-purger.i386 3.1.10-3.slc4

  • Issue:

yum install glite-wms-ice


This should install:

Installed: glite-ce-cream-client-api-c.i386 1.8.4-2.slc4
Installed: glite-ce-monitor-client-api-c.i386 1.8.0-6.slc4
Installed: glite-wms-ice.i386 3.1.27-1.slc4

In the configuration file of WMS ( /opt/glite/etc/glite-wms.conf) set these parameters:

 
ICE =  [
    start_listener  =  true;
    start_lease_updater  =  true;
    logfile  =  "${GLITE_LOCATION_LOG}/ice.log";
    log_on_file = true;
    creamdelegation_url_prefix  =  "https://";
    listener_enable_authz  =  true;
    poller_status_threshold_time  =  30*60;
    ice_topic  =  "CREAM_JOBS";
    subscription_update_threshold_time  =  3600;
    lease_delta_time  =  2*60*60;
    notification_frequency  =  3*60;
    start_proxy_renewer  =  true;
    max_logfile_size  =  200*1024*1024;
    ice_host_cert  =  "/home/glite/.certs/hostcert.pem";
    Input  =  "${GLITE_LOCATION_VAR}/ice/ice_fl";
    job_cancellation_threshold_time  =  300;
    poller_delay  =  2*60;
    persist_dir  =  "${GLITE_LOCATION_VAR}/ice/persist_dir";
    lease_update_frequency  =  20*60;
    log_on_console = false;
    cream_url_postfix  =  "/ce-cream/services/CREAM2";
    subscription_duration  =  86400;
    bulk_query_size  =  100;
    purge_jobs  =  true;
    InputType  =  "filelist";
    listener_port  =  7010;
    listener_enable_authn  =  true;
    ice_host_key  =  "/home/glite/.certs/hostkey.pem";
    start_poller  =  true;
    creamdelegation_url_postfix  =  "/ce-cream/services/gridsite-delegation";
    cream_url_prefix  =  "https://";
    max_ice_threads  =  10;
    cemon_url_prefix  =  "https://";
    start_subscription_updater  =  true;
    proxy_renewal_frequency  =  600;
    ice_log_level  =  700;
    soap_timeout  =  60;
    start_job_killer  =  true;
    max_logfile_rotations  =  20;
    cemon_url_postfix  =  "/ce-monitor/services/CEMonitor";
    max_ice_mem = 4000000;
    ice_empty_threshold = 600;
    ];

Then use /opt/glite/etc/init.d/glite-wms-ice [start/stop/status] to run the ICE service.

Pilot Layout

Phase1

  • Cream CEs:
-CNAF: cert-ce-03.cnaf.infn.it
-FZK: pps-cream-fzk.gridka.de

  • ICE WMS:
-FZK: pps-rb-fzk.gridka.de
-SCAI: glite-wms2.scai.fraunhofer.de

  • Available CLIs:
-CNAF: cert-ui-01.cnaf.infn.it

Phase2

Results

Phase1

General comments on installaton procedure

SAM

Nagios

Phase2

Credits

This pilot has been conceived taking into account the plans of JRA1/SA3 developed within the cluster of competence and the SA3 guidelines for certification

CREAM and ICE precertification
================================

Foreword
-------
The new certification model in the EGEE-III project foresees a
pre-certification phase done by the so-called "cluster of competence".
A cluster of competence in charge of the pre-certification of a certain 
software component is composed by the JRA1 developers of that component,
and by the SA3 people close (i.e. local) to them. The collaboration 
of SA1 is also foreseen.
The pre-certification phase is then followed by a formal certification process
usually done by a partner different than the one in charge of the development
and of the pre-certification of that component.
This formal certification phase is supposed to be very quick, since most (all)
of the problems should have been found and addressed during the 
pre-certification step.

So the Italian cluster of competence is in charge of the pre-certification
of CREAM and ICE


Testbeds
--------
For the pre-certification of CREAM and ICE, we envisage the use of 
2 testbeds:

- Testbed a: small testbed, supposed to be used by the cluster of competence
  people, that is by the CREAM and ICE developers and by the SA3 Italian 
  people. This testbed is supposed to be used mainly for functionality tests
  and for some limited stress tests.


- Testbed b: larger testbed, supposed to be used mainly by experiment people,
  and in particular for scalability tests.
  This testbed basically corresponds to the "experimental services" considered
  so far for the WMS


Hardware requirements for pre-certification testbeds
----------------------------------------------------
For testbed a:
         1 UI node
    1 WMS node
    1 LB node
    1 BDII comprising both CREAM and LCG based CEs (the LCG based CEs
           are the production ones)
    4 CREAM based CEs, possibly distributed in different sites

For testbed b:
    2 WMS nodes
    1 LB node
    1 BDII comprising both CREAM and LCG based CEs (the LCG based CEs are
           the production ones): this can be the same BDII used in testbed a 
    At least 20 CREAM based CEs, possibly distributed in different sites

For what concerns the WNs, it is not necessary to have dedicated machines
for such testbeds but the same WNs used in production can be used.
It is just a matter to reserve a certain number of slots (e.g. 50) to the 
queues dedicated to the CREAM pre-certification tests.


Configuration of the testbeds
-----------------------------
On both testbeds it is suggested to devote 2 queues per CREAM CE (so 2 CEIDs 
per CREAM CE machine) to these tests.

For what concerns the VOs to enable, on testbed b the "production" VOs should
be authorized. On testbed a also the 21 "fake" VOs should be enabled, so
that testers can perform tests submitting jobs on behalf of multiple users
belonging to different VOs.

Updates on the testbeds
-----------------------
Both testbeds are supposed to be updated (WMSes and/or CREAM CEs) whenever
a new blocking issue is found and/or whenever a certain number of new 
fixes for non-blocking problems should be tested.
It is supposed that testbed a will be updated much more often than 
testbed b. 
Testbed b should be updated only after having updated testbed
a, and after having tested (by the cluster of competence people) on
the testbed a the updated version.

Software deployed on the certification testbeds must be tagged
(tags will be done on the proper "pre-certification" CVS branches) 


-------- Original Message --------
Subject: CREAM testing plan
Date: Fri, 23 May 2008 15:12:20 +0200
From: Oliver Keeble <oliver.keeble@cern.ch>
To: Markus Schulz <Markus.Schulz@cern.ch>,  Francesco Giacomini <francesco.giacomini@cnaf.infn.it>, Di Qing <Di.Qing@cern.ch>


My summary of the plan;

Plan and criteria for CREAM certification

Two broad criteria

   * scalability at the level previously defined in CE acceptance criteria
   * functionality verified based on the CLI spec, direct verification of
the web service interface, and a set of CREAM tests functionally
equivalent to those currently run against the lcg-CE

Andreas to organise with Massimo a functional test plan, and to find
resources for writing the tests (should not take more than a week).

CERN will add extra resources to the test infrastructure currently used
to validate the lcg-CE, including a CREAM CE and a production-level
(non-ICE) WMS. CERN will run the tests, including the 5 day soak.

Comment - we are *not* certifying ICE, and would like to avoid
difficulties in interpreting test results if possible. Di will make a
judgement as to whether we can simply loop over the CREAM CLI tests we
will have in order to do the appropriate scalability validation. If so,
this is the approach we will take. If not, we will take the ICE rpms and
upgrade the testbed WMS. There's is a question over testing
proxy-renewal if we don't use the WMS.

When CREAM is released to production, it will be advertised as being
made available for larger sites to install in parallel with their
existing lcg-CEs, not as a replacement for them. In this way we will
soon have a pool of CREAM CEs exposed to production work patterns and
loading, without endangering availability of resources.

After initial release, responsibility for CREAM scalability/stress
testing will pass to INFN and CERN certification will not invoke such tests.

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif creampilotph1.gif r1 manage 6.4 K 2008-06-26 - 01:04 AntonioRetico Initial plan for phase 1
Edit | Attach | Watch | Print version | History: r55 | r12 < r11 < r10 < r9 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r10 - 2008-07-02 - ClemensKoerdt
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback