PPS Pilot Follow-up Meeting Minutes Tue 03 Feb 2009

  • Date: Tue 03 Feb 2009
  • Agenda: 49840
  • Description: Pilot of Cream CE: check-point
  • Chair: Antonio Retico

Attendance

  • PPS: Antonio Retico

  • PIC: Raquel Munoz, Christian Neissner
  • FZK: Angela Poschlad
  • SCAI: Klare Kassirer
  • CNAF: Daniele Cesini
  • PADOVA: Sara Bertocco

  • CMS: Andrea Sciaba'
  • Alice: Patricia Mendez
  • LHCb: ...

  • JRA1/Cream/WMS: Massimo Sgaravatto
  • SA3: Alessio Gianelle

Review of action items (tasks)

Status of the subtasks of TASK:7981(see them in the PPS tracker ) .

#7981: Set-up and run Cream CE Pilot (Phase2), 45 days left

TASK:7994: Run UI @ PPS-CNAF, 45 days left

TASK:7427: Configure FCR in PPS to handle CREAM-CE, 85 days behind

No news reported

TASK:7144: Define the layout of future SAM tests for Cream CEs, 85 days behind

Development of Cream tests for Nagios are in progress at CERN

TASK:7157: Verify behaviour of CreamCE in nagios, 92 days behind

Development of Cream tests for Nagios are in progress at CERN

TASK:7992: Run Cream@FZK-PPS, 45 days left

TASK:7991: Run WMS + ICE @ CNAF, 45 days left

TASK:7990: Run Cream@INFN-CNAF, 45 days left

TASK:7989: Run Cream@INFN-PADOVA, 1 days left

TASK:7988: Run Cream@INFN-BARI, 45 days left

TASK:8144: remove pilot-specific queues from the PPS BDII, 101 days left

Status and results of the pilot service (by VOs and sites)

- Involvement of PIC in the pilot

Antonio: welcome to PIC as they have now joined the pilot.
PIC has accepted to run a set-up very similar to the one in PADOVA (multiple CEs on top of a single batch system accessing the production queue)
They are currently running two CEs published in their production BDII. As far as I know this set-up was useful to Enzo Miccio(CMS) to run some of his tests in Padova.

Andrea: the idea would be to have in the production BDII with a GlueCEStateStatus different from "Production" . Probably Enzo he was using our test suite, but this is not equivalent to have the set-up tested by CRAB

Christian: we have already installed two CREAM CEs. We would prefer to publish the production queues in the PPS BDII.

Antonio: this is the set-up that we suggested all the WNs in the back-end behind offer the standard production set-up (AFS and CMS software area available) the only difference would be the BDII used by the pilot WMS. Is that possible to configure CRAB to use that WMS?

Andrea: as CRAB is configured using the same UI configuration file it is in theory possible to configure CRAB in such a w0ay. What is the WMS to be used?

Antonio: The one at CNAF, documented in the pilot main page in the layout for phase2.

Andrea: a technical detail. In order to submit with CRAB the production site Storage Element needs to be configured as "close" to the computing element because the matchmaking is done starting from the storage element.

Christian: At PIC there's no problem because we have only the production storage to work with

Daniele: At CNAF this can be a problem, we have to investigate with the T1 admins

Antonio: consider that the VOs are requesting it and it's their own storage at the stack, so it shouldn't be a concern. The idea of the pilot is to build an isolated environment where the users know exactly what they are doing. Interactions with the production system in this context are allowed.

Andrea: as long as the computing element is not published in production with the "production" status it shouldn't be accidentally matched so having a SE configured as close SE for that node shouldn't be a risk. This use of the attribute was agreed long ago

Daniele: what if the configuration is accidentally changed and the CREAM goes to "Production" by mistake?

Massimo: there is now a parameter in YAIM to manage this value and if the variable is not set the default value used is "Special". That should protect the production system against accidental re-configurations.

Antonio: considering also that the CREAM service deployed should support only Alice, CMS and OPS, what is the risk in declaring the production SE as close CE of CREAM?

Daniele and Christian: In that case the set-up should be protected enough.

Antonio: to answer to Christian's comment, we go on with the current set-up at PIC and let CMS try and use it. Modifications to the set-up will be done if requested.

Christian: is it requested to have more than 2 CEs per site?

Antonio: this is not a production use case of course, but it would be useful as the focus of this phase of the pilot is on the ICE submission engine. Therefore if you have the possibility to instantiate a bigger number of CREAM CEs, eventually on virtual machines) that could be good.

Christian: if you define the number we could do it.

Antonio will create the new tasks with PIC and agree with Raquel about the effort to put there.


Antonio reported on the CREAM CEs installed by the EGEE regions (data gathered during the SA1 coordination meeting)

  • Italy update: starting today (3-Feb) the installation for Alice at the T1. Actively working in PPS pilot at INFN-PADOVA.
  • APROC update: installation re-scheduled at ASGC. expected by the end of this week (6-Feb) SEE update: in Greece, progress but no news, will be ready by next week (9/13-Feb)
  • Nordic: working according to plan (which, on 13-Jan, was at least 1 month for KTH --> 20-Feb) Benelux update: No updates received
  • UKI update: Installation of Cream in production finished at RAL. Setting up the gridftp server and finding some more storage on our local Alice VOBox as Alice need these to be able to use Cream
  • SWE: CREAM installed at PIC-PPS (with access to production queues. The installation is now being debugged by the developers
  • FR: Planning to have it in March but site admin is now off sick for 2 months, so unsure.
  • CE: scheduled installation, will be ready in 2 weeks (by 20-Feb)
  • RUSSIA: i CREAM was installed at RU-Protvino-IHEP SAM tests available.

Antonio asked Daniele for an update about the installation for Alice in progress at the T1

Daniele knows that there is an installation fo gridftp server and VOBOX in progress at the T1 and thinks that they are going to do it this week

Antonio: it would be good if they could use the pilot version. Alice repeated the test with the current version of Cream and it performed well. So it would be great if you could lobby the installation of the pilot version of Cream.


Reports form the VOs

Alice

Patricia: Each time we had a new site we have tested. For the time being we have used three sites: FZK, one site in Russia (Protvino?) and Kolkata . Now the tests are completely stopped and we are waiting for CNAF. It is important for us to know when CREAM is going to be provided at CERN.

Antonio will ask and report.

CMS

Antonio: in the last meeting we agreed that CMS should wait for the issue of high-failure rate. Can we let them start the testing now?

Massimo: the high-failure rate was tracked with three different bugs. We fixed these bugs and we did another test submitting a collection of 40 jobs/min for 5 days . the jobs were 5 minutes long (enought test the proxy renewal).

(Statistics taken from a previous message from Massimo)

Collections correctly submitted: 7180 ( 287200 jobs)

  • DONE OK: 284838 (99.18%)
  • ABORTED: 0 (0.0%)
  • Not finished: 2362 (0.82%)
  • Resubmissions: 4599 (1.60%)

2 causes of resubmission were identified: RLS daemon of LSF failing at and resubmitted jobs: and crashes in the BLAH BL parser the relevant bugs are

the fixes are tracked in the PATCH:2448

After this fix we run another test with much longer jobs submitted by 5 different users. This new test brought us several issues, with the proxy renewal daemon running on the WMS nodes. We haven't really understood the cause and apparently by using a different myproxy the issue disappears. In general that there are many more jobs active in the system between WMS and CEs seems to lead to some performance issues which we are trying to understand.

Antonio: so, considering all these issues does it make sense for CMS to start or do you prefer to spend more work on further research.

Massimo: it makes sense for CMS to start but they should use a different set of CEs than PADOVA and CNAF, where there could be conflicts with our tests. There are no conflicts on the WMS at CNAF but we need some updates on that machine (to be done by Daniele)

Andrea: can we have a list of the machines where we are supposed to run?

Antonio: What CMS should expect to see are the CREAM CEs in FZK, PIC, the part of CNAF server by the pbs batch system. A list of the relevant machines will be provided . We start with the current configuration and then we evolve according to CMS needs. The overall objectivebeing to let CMS stress ICE a bit.

Andrea: CMS will wait for a green light before starting

Status and results of the development (by developers)

Covered in the CMS section

Open Issues (by VOs, sites, deployment teams)

Antonio reported that BARI appears to be inactive in the tasks

Massimo confirms that currently the interactions with BARI are very rare and limited to cunsulting about Torque issues

The decision is made to exonerate BARI from the pilot activity for the time being. Relevant changes on the tasks will be done.

List of Open bugs and relevant decisions

Recommendations for release and deployment

Decision about termination/extension of the pilot

Next check point meeting scheduled for Wednesday 18th February at 10 AM

AOB


Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2009-02-04 - AntonioRetico
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback