PPS Pilot Follow-up Meeting Minutes Thu 28 May 2009

  • Date: Thu 28 May 2009
  • Agenda: 59823
  • Description: Pilot of glexec/SCAS: check-point
  • Chair: Antonio Retico

Attendance

  • PPS: Antonio Retico
  • SA3/Certification: Apologies
  • JRA1/Development: Oscar Koeroo; Mischa Salle
  • EGEE Operations: Nick Thackray
  • Atlas: Jose Caballero
  • LHCb: Stuart Paterson
  • IN2P3: Pierre Girard
  • Lancaster: Peter Love
  • FZK: Apologies

Review of action items (tasks)

SA1/SA3 tasks

Status of the subtasks of TASK:8986 (see them in the PPS tracker ) .

other tasks

Warning: Can't find topic LCG.PpsPilotNAME

Notes:

not covered

Status and results of the pilot service (by VOs and sites)

Atlas

Jose: we have been running tests against Atlas. The installation is almost perfect with the exception of glexec rejecting the regenerated proxy in case it contains a VOMS attribute that has expired in the meantime. we are waiting to repeat the tests at FZK when Angela will be back in order to understand whether there is a configuration issue at Lancaster or something in the releas. BTW the test worked at gridka the first time and now is failing , so there is the suspect of a regression of the bug https://savannah.cern.ch/bugs/?41472 which was supposed to be fixed.

Peter confirms that they're waitning for the rpm list from Angela and asks Atlas what will be the status of the pilot for STEP09, will Atlas be ready to use it? it would be useful to submit pilots with glexec during STEP09 Jose' replied that Atlas' plan (discussed with Paul Nillsson and Massimo) is to adapt the Atlas framework to use glexec. The forecast is to be ready by next week.

Additional comment: there is a workarond for the issue seen in Lancaster (expired voms attribute) based on the use of an environment variable to be set both on WNs and UI . We are in control of the environment on the WN but not in the UI. So if this workaround needs to be applied a modification to the UI environment is needed.

Antonio: for that it is very important to open a bug to track the issue (or maybe to reopen BUG:41472 ?) describing issue and workaround. Jose will do it (action created)

LHCb

Stuart: Roberto has tested IN2P3 with trivial tests. there were a lot of configuration problems We would like to know when we can re-start after the configuration changes.

Pierre: the set-up is done so the test can be done starting from Tuesday. You have to consider however that due to air cooling problems we don't have our full capacity but just one third, so load tests should be avoided.

Antonio: It would be a good idea to start testing at Lancaster as well as they are now ready.

Stuart: the other point I want to raise is the issue with the environment reported in many occasion. To us it seems natural to have the gridenv set-up locally. we can cache the environment but it seems just another dirty trick. We don't need the full pilot environment but we would like to have the gridenv.

Antonio: I'll let Mischa comment on that. What I would like to understand is how difficult would be for Dirac to interact with glexec as it is now.

Stuart: it is not difficult. We just think it's not clean.

Mischa: there are several possible workaround and solutions (login shell, white-listing certain variable) and at the moment we have trouble to understand which one is the better because all of them introduce security risks

Stuart: we tested it at Nikhef and we didn't have this problem (we ended-up with grid environment problem). We would like to have a uniform behaviour wherever it's deployed.

Mischa: that's interesting. It shouldn't have happened. We need to check

Stuart: in conclusion. we can work around it but we want to raise the point.

Antonio: you are very welcome to raise it and the best way to do it is to open a bug. Ultimately our goal is not to discuss functional specifications or requirements to decide whether this particular version is suitable or not for a deployment in production and the bugs are the documents we will use for the decision.

Jose': Atlas has a solution for managing the environment and perhaps it could be generalised at the level of the middleware.

Antonio: your experience is valuable and you are welcome to post your suggestion into the bug that Stuart will open. My impression is that the decision whether to fix this issue centrally or leave it to the experiments framework will depend a lot on a cost-benefit ratio . It will have to be made probably within the TMB

Lancaster

Antonio: question: apart from the voms attribute related issue, is your site in general working now

Peter: we are not using SCAS because the client plugins are not available and in general the installation was tricky. For example if now we have to support LHCb there would be a lot of work to do.

Antonio: In the pilot wiki page we collect info about the layout (PpsPilotSCAS#Pilot_Layout) and the installation issues (PpsPilotSCAS#Comments_and_issues_from_operati) experienced at the different sites and this is usually very valuable information to the other sites and to the editors of the release notes. Can you please record your configuration notes in PpsPilotSCAS ?

Peter: I will do it.

IN2P3

Pierre: glexec is deployed on the production cluster and it is available to LHCb only for now but we can open to Atals if necessary.

I installed a second SCAS server but for the time being I want to test with one first. We are ready to tests from LHCb and Atlas.

Because of the air cooling problems only 30% of nodes are available and 40 disk servers down so this week we cannot sustain load tests. It would be good however to see the full chain glexec SCAS at work. I'll be there to help for the tests For LHCb the configuration is done so they can start when they want. For Atlas I only need to configure authentication and authorization ando something on the queues but it is not a lot of work.

Jose: I can try. The only thing for Atlas to test is that we need the myproxy client on the WN

Pierre: we have to check. I did it but I am not sure it's working. We can et you test discover it.

Antonio: if it can help you. the myproxy is being propagated on the WN now. We received it in PPS recently so you may have a tarball with my proxy available somewhere in the release line. I'll let you know.

Everybody agrees on Tuesday 2nd of June as the date to start the tests at Lyon.

Status and results of the development (by developers)

Oscar: we see from the last MB meeting that there are all kinds of very active discussion in progress. Although Atlas has a very sound solution for using glexec there are still many open questions both from VOs and sysadmins. We are willing to organise a workshop at NIKHEF to address both kind of users in a face-to face meeting in order to bootstrap the deployment. This could be done, if required, even on a short notice

Antonio: I don't know exactly which body should make the decision to have a workshop, probably the GDB. a face-to face meeting is always useful. Actually putting the developers in contact with sysadmins and end-users was one of the drivers of this pre-production activity. I agree that a face-to-face workshop would be even more direct and it is welcome. I would like to point out however that especially in the last two weks a lot of discussion has happened through mail threads (sometimes not even posted to our mailing list). A lot of valuable technical information has been exchanged but probably lost in this volatile communication. We can do better in this respect if we start discussing technical issue in bugs. Ultimately, as I said, the decision to deploy this version or not will be made on open bugs. My personal impression is that this version of glexec is at least not disruptive, so it would be recommendable in my opinion to allow this version to get into the production repository and let other sites try, as John Gordon is asking the sites to do in the GDB.

Nick: the next GDB is the 10th. If we manage to open an exhaustive list of bugs. we could analyse and discuss it within the GDB and there the decisions related to the future of this version and of the opportunity or not to have a workshop.

Open Issues (by VOs, sites, deployment teams)

List of Open bugs and relevant decisions

Recommendations for release and deployment

none

Decision about termination/extension of the pilot

We will have another check-point on the 9th at 16:00 CEST (there is a conflict with the SA1 Coordination f2f meeting, Antonio and Nick may have)

AOB



This topic: LCG > WebHome > LCGGridDeployment > GLitePreProductionServices > EGEE_PPS_Coordination > PPSMeetings > PPIslandKickOff > PPIslandFollowUp2009x05x28
Topic revision: r2 - 2009-05-30 - AntonioRetico
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback