PPS Pilot Follow-up Meeting Minutes Thu 19 Feb 2009

  • Date: Thu 19 Feb 2009
  • Agenda: 52981
  • Description: Pilot of glexec/SCAS: check-point
  • Chair: Antonio Retico

Attendance

  • PPS: Antonio Retico
  • glite Integration Testing Release (EGEE-SA3): Gianni Pucciani
  • glexec/SCAS development (EGEE-JRA1): Oscar Koeero
  • Atlas: Maxim Potekhine, Jose Caballero, Massimo Lamanna
  • FZK: Angela Poschlad
  • CNAF (pilot repositories) : Danilo Dongiovanni tried unsuccessfully to connect
  • LHCb: Absent

Review of action items (tasks)

SA1/SA3 tasks

Status of the subtasks of TASK:8986 (see them in the PPS tracker ) .

other tasks


Notes:

The installation at FZK went through according to the schedule (actually it was two days early)

Oscar gave an update on his action (delivery of the new version of glexec supporting error codes and relevant specification):
tomorrow (20-feb-2009) a new version will be released to SA3. This will be done via cloning the existing patch and submitting a new one.
Antonio: It is good to deliver according to the standard release process, but should there be the need for more frequent changes to be applied by the developers we agree within other pilots to a different strategy in order to save time. Basically a patch is created and left in status "With Provider" and the modifications delivered to the pilot are documented there incrementally while they are distributed to the pilot. In this way the changes are documented appropriately and at the same time the existing patch can follow their standard path through certification/preproduction. Of course the certification team is involved in the pilot so they can watch on the content and raise exceptions earlier if needed. The standard process can of course be followed, up to the developers. But the main goal of the pilot in this phase for all the concerned parts is to support in an effective way the integration of the error codes in the experiment's submitting framework. Oscar: So I can provide just the differences wrt what Angela has installed. That would consist only in a glexec rpm
Antonio: so I propose you to create anew patch that contains only that rpm and give to Angela the link and the documentation to upgrade
Oscar: will provide the documentation in the new patch Angela agrees with the procedure
the agreed date for the delivery is 20-Feb-09 (afternoon)

Status and results of the pilot service (by VOs and sites)

Massimo: no activity related to this pilot to report for Atlas this week.
Maxim: Jose has been working last week creating a new version of the pilot to comply to the way th DM data transfer is implemented in PanDa. This turned out to require a significant change in the code. We work with the glexec executable we get from the OSG distribution. So apparently the glexec code is evolving and we need a new version to be distributed in OSG to stay in synch
Oscar: there is a significant difference in the way glexec is configured in Europe and US . Specifically in US glexec would stay in the process list whereas in the EU configuration the glexec process would be replaced by the called process. This is a little difference that you may notice when using the European installation (FZK)
Within an hour I will contact OSG and synchronise their version with what we have today Antonio/Massimo: Please consider that this is a pilot activity. The idea is to work with a newer version (not distributed yet in gLite) and try it out in the production environment before it is widely distributed. So the PanDA developers should try and interact with FZK to try the new error codes and then, if everything is correct they can be further distributed. The deployment of new versions to OSG are independent from the pilot activity and it would not make sense to do a full release cycle to OSG of something that, in theory, could need to be changed quite frequently. Antonio: One thing that should be done soon by Atlas (we were expecting it to happen already this week) is to test the glexec call at FZK, just to see if everything is set-up correctly or there is something else to provide at the site. Can we expect this functional test of the old-style glexec functionality to be done at FZK within this week or the beginning of the next one?
Maxim: We can submit a job (not the compete Atlas pilot) to make sure that we invoke glexec correctly
Oscar made a comment about the error codes: the new four codes used to return groups of error do not overlap with the existing one. So the submission frameworks' logic can either handle any kind of error codes as they did before and eventually treat differently the new ones. In conclusion I don' think that the we have an incompatibility problem.
Antonio: The backward compatibility is an important point but as the functionality has not been deployed in the EGEE grid production so far it is not one of the main objects of the pilot. We recognise though that backward compatibility can be important for OSG. The return of appropriate error codes is required for site operations reasons if not by the applications.

Status and results of the development (by developers)

Oscar will do stress test of the new glexec (with error codes) today and tomorrow morning on our testing facility and provide it to Gianni tomorrow (20-Feb)

Open Issues (by VOs, sites, deployment teams)

none

Update to planning

Antonio: in our first plan the ramp up of production resources at FZK and IN2P3 was supposed to start the 1st of March and to finish by the 6th. that would leave one week, the next for PanDA to apply and test the needed changes Considering the interaction in progress I' day that it would make sense for Atlas to shift the ramp-up one week later. Massimo: I think indeed that we should have at least a couple of iterations with the current installation at FZK before moving forward. We wouldn't learn much more by deploying at Lyon now Maxim: as there are major changes to be done to the PanDA pilot however, re-scheduling the installation or not is irrelevant to us, provided that the pilots don't start failing Massimo: that would not happen, but starting an extended deployment of something that may change frequently would be expensive for the sites, which have other things to do as well. Maxim: there is an OSG meeting at the beginning of March and then many of us will be heading to Europe.So Something will be done in terms of coding, but there will not be the usual output Antonio: So Atlas agrees that it is fine to work with this set-up until the interaction is stable. There may be different requirements coming from LHCb but as they haven't run any test on this installation yet, and so the activity is delayed from their part, I will reschedule the ramp-up to start one week later (the 6th), unless there are issues found in the PanDA-glexec interaction before.

AOB

1) Massimo asked Oscar for some comments about the results of the stress tests done by Gianni especially about the memory leak. The results are related to the version currently in certification (and deployed at FZK).

Oscar: there was a lot of testing and we found out and fixed elements in lcas, lcmaps and scas but the main problem is in the library from globus that I am using for protocol handling. We reported the problem to globus and we are following up also trying to see how we can help them.

We have reduced the spikes of leakage. We had 6M invocations, an error rate of 0.2% and SCAS up and running for 6 days without any interventions. That's not bad for a system with such a big leakage

This is done with an internal re-start of a child process every 5 minutes (the re-start time is few milliseconds). If this timeout could be stretched to a longer period of time the error rate wold definitely go down

Massimo: in fact the error distribution shows that the errors are concentrated around the re-start

Oscar: the test was done with 1M to 6M invocations per day. This is a good test to point out the problem. however in OSG we see that the rate imposed by PanDA is 100k- 300k per day. So it is possible that as in production the calls are reduced this error rate goes down

Massimo:another question about response time. From the graphs we see that motly the system responds within the second, then sometimes the response time jumps to 6 seconds with nothing in the middle

Oscar: this is the client doing a TCP and SSL handshake . if something fails from the client site to establish the TCP connection the client retries after 2s 4s 8s 16s ... I don't see why there is this concentration around six. I will investigate on this

2) Antonio: Pierre Girard during the last check point was asking about the status of the load balancing. This is an important feature to set up because currently SCAS is a single point of failure for the whole site

Oscar: I will address this problem with a new version of the SCAS client plugin. A new version where you can configure an unlimited number of end points. This could be ready by next week.

Massimo: How do you implement the load balancing? are you multiplying the SCAS front-end only oor the back-end database as well?

Oscar: the front-end only. This is the same mechanism and code base used in the back-end current lcg-CE gatekeeper and the gridftp deamons

3) this morning the sysadmins at NIKHEF have stated a message that glexec is installed on all the production WNs and the backend is configured exactly to be used exactly on the same gridmap directory of the gatekeeper

Antonio: so we have to welcome NIKHEF as a de facto supporter of this pilot

Oscar: Unofficially yes. LHCb is already configured and their pilot jobs account is white-listed.

4) Gianni: Oscar, as you are leaving at the end of March. Is there anyone taking over the development

Oscar: David and Michel have enough clearance to work with cvs and etics. At this moment there are four people covering for my application. I will distribute also the mailing list that we use for support

Next check point will be on the 26th of March and chaired by Gianni


Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2009-02-21 - AntonioRetico
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback