PP Kick-Off Meeting Minutes Thu 05 Feb 2009

  • Date: Thu 05 Feb 2009
  • Agenda: 52054
  • Description: Pilot Service of glexec/SCAS
  • Chair: Antonio Retico

Attendance

  • PPS/SA1: Antonio Retico
  • FZK: Angela Poschlad
  • IN2P3: Pierre Girard
  • CMS: Andrea Sciaba'
  • Atlas: Massimo Lamanna, Maarten Litmaath
  • LHCb: Roberto Santinelli
  • JRA1: Oscar Koeroo
  • SA3: Gianni Pucciani
  • Claudio Grandi

Service Introduction (by PPS)


Quick summary of the events previous to this meeting

The gLite release team informed us that they reckon the new SCAS service to be in a sufficiently stable condition for a pilot service to be set up. In particular the most severe issues found earlier (memory leaks, bad configuration) were solved. In parallel we contacted the LHC experiments (specifically CMS, Atlas and LHCb) in order to address the activity and they were in favour of a controlled deployment in production of a pilot service based on some instances of SCAS. Specifically LHCb would like a supporting T1 to be involved in the pilot, and suggest IN2P3 and/or FZK as first choices. both sites replied they were interested and so they were invited


Information from certification

The patches in object are

Gianni: The software is currently undergoing stress testing in certification.

Antonio exposed the purpose of the meeting: to find the best possible layout to allow the experiments to do preliminary testing of the glexec/scas chain, based on the functionality currently exposed by the patch in certification. With respect to the functionality he pointed out that during the preliminary analysis an issue was identified in the fact that glexec in the current version does not discriminate between errors due to non authorised users and other type o errors (e.g. non-availability of the SCAS back-end). He thinks that this is a major issue that should be addressed before the software is deployed in production. He suggests as an alternative a more restricted set-up based on a PPS site (published in PPS BDII).

LHCb and Atlas were informed of these concerns before the meeting and they produced the following replies

LHCb :

LHCb has got an internal tracking mechanism. So glexec is perceived as a complication. However they recognise glexec as a constraint imposed in order to run on the middleware and they will comply with glexec decision

Despite the undistinguished return codes LHCb strongly wants the pilot in production to be run. This is the best way to realise what the unexpected problems/complications may be.

LHCb will always interpret a negative reply from glexec as the impossibility for the user to run the payload on that site, independently on the fact that this negative reply may be caused by technical issues. LHCb approach in that case will be to schedule another job on the same site with a different user

LHCb is considering the possibility to set up a SAM sensor to extensively test the glexec/SCAS functionality

A discussion followed about the feasibility of the approach suggested by LHCb of using SAM jobs to monitor the correct behaviour of SCAS. Maarten pointed out that actually using the pilot certificate and one more user certificate the basic functionality of the service can be tested, but that wouldn't be an exhaustive test for the VO, because that would require the use of all the users' certificate. perhaps you can obtain a slightly richer test case with the same certificate if the mapping is done differently for different FQANs Oscar: If the FQAN is significantly different (e.g. role "production" vs role "operator") they will actually be mapped to different accounts.
At Nikhef we are also working to the monitoring infrastructure and thinking of Nagios based tests sensors Antonio: the nagios sensor is something of general utility so as soon as progresses are made they should be integrated in the pilot. The development and configuration of monitoring and operational tools is normally in the scope of PPS pilots. this option will be considered later on, when progresses are done with the sensors.

Antonio comments on LHCb points: The approach LHCb plans to use to deal with glexec failure is correct. I want just to be sure that you are aware that the hypothetical operation of the glexec service as it is now in production could lead to a significant reduction of the capacity available at the sites for LHCb as a consequence of normal operational downtimes of the SCAS service. As the deployment in production is the ultimate goal of a piloting activity, I am in doubt that it is worth starting the expensive pilot works before this functional limitation is fixed. Are we, as middleware providers really confident that this service can go in production as it is now?

A technical discussion (Massimo, Oscar, Maarten) followed to better analyse and clarify the current situation with the exit codes. _Summary_
At the moment both the auth decision and the mapping are undiscriminated in glexec from technical failure. Oscar: glexec can be configured to forward error codes from the application so if the application returns 0 and glexec a sufficiently high code for its internal errors it should be possible for the framework to distinguish glexec currently can return ~15-16 different exit codes which we currently don't return and we envisage to bundle them in a very limited and focused set of error codes.

Massimo (about confidentiality) You mentioned confidentiality as a reason not to provide information about a user not being authorised. I think that there shouldn't be constraints in this respect, as the system already knows the user proxy on the other side Oscar: the details are logged in the SCAS logs reasons for non authorisation have to be hidden otherwise that would be a privacy exposre. Maarten: logging in the system means that in case of problems a site admin has to be involved in the analysis every time a pilot cannot run which is not efficient Massimo: it is very important that the submission system be put in the condition to discriminate an authorisation problem from others. otherwise a particular use not authorised to run on a particular site would be blindly re-scheduled (possibly with increasing priority). Maarten: the more info we give to the submission problem the less the site needs to be involved. Of course the submitter doesn't need to know the reasons why a user was banned, but knowing that the user is banned (vs. e.g. an expired proxy) is important.

Oscar will send the documentation of the existing error codes to the list egee-pps-pilot-scas@cernNOSPAMPLEASE.ch mailing list

Atlas

Massimo:As Atlas we welcome the idea of FZK and IN2P3 starting the pilot activity. If we could see it running without problems in real production condition for one month it would be very good.

Antonio: That's the way we want to do as well. Only, on behalf of SA1 I have to remind that running a pilot in production does not come for free, but there are resources at the sites which are strongly committed. As we start with a service for which we already know there are limitations and therefore couldn't be released to production as is I would try and optimise the costs of running the pilot in this phase

Massimo: I think that if FZK could start rather soon already some of the first issues could be found and then IN2P3 could step in once the issue of the return codes is fixed. Of course a deployment in production would be authorised only when the pilot has determined that a good integration is reached. Atlas and LHCb have invested a lot, so I think that we should send a strong message that things are moving and working in production environment with real users and use cases can give a big push.

Roberto: only the production environment can show the evidence of real problems. LHCb's wish would be to deploy glexec everywhere as soon as possible. Working in PPS does not come for free for LHCb nor for Atlas (dedicated agents, dedicated RB and nobody will run there)

Antonio: As developments are needed from the side of the submission frameworks PanDa and Dirac3 as well , and some effort in the experiments will however have to be allocated for this work. You are saying that you would however prefer to use the production environment rather than the PPS one as the development workspace. I thought that doing differently would have been be globally more economic. However, as SA1, I will put as a condition that this issue is fixed before the thing can go to production.

Maarten: or that the pilot has shown that in the reality this is not a big issue.

Antonio: we have to be careful with that because we can expect a higher number of cases of SCAS misconfigured or off-line in real production than the ones we likely will have in the controlled environment of the pilot.

Roberto: the decision to deploy it fixed or not must depend also on the timeline for the fix

Oscar: there is a critical bug open and I think that two weeks should be a reasonable time to provide a fix to the error codes.

Antonio: the fix should be given a high priority by the EMT

Maarten: a recommendation:please follow the good 80-20 rule. better to have a good although not perfect fix in two weeks rather than a in a month

Layout

Massimo: a requirement for Atlas is to use the lcg-CE to access the WNs

Antonio: probably we'll have to set-up a CE that is not part of the production e.g. a subcluster accessing a subset of nodes with glexec enabled

Maarten. that's not sure. It really depends on the site administrators maybe they just want to deploy over all the WNs

Antonio: wouldn't a deployment on the whole production introduce the risk for the whole site to "disappear" for the VO if something goes bad

Roberto: LHCb has a way to switch off the use of glexec in case of problems and re-start as usual submitting with the role "pilot"

NIKHEF

Oscar: at NIKHEF we plan to deploy glexec early next week and SCAS over the whole production soon after the small test on the testign facility

Massimo: this appears not to be following the roll-out strategy which we are agreeing on in this pilot, that is to install at FZK first and IN2P3. Of course a third site is welcome to join but we shoudl move in a coordinated way

Oscar: Nikhef is not a PPS site and we want to demonstrate as soon as possible the validity of glexec in production.

Antonio: none of the sites in this pilot are pre-production sites. Nikhef is welcome to join the preproduction activity with its production sites, but that should be done in a coordinated way

FZK

Angela: before upgrading the whole production at NIKHEF we will need to test in our pre-production especially if a roll-back is not possible

Antonio: the roll-back should be possible as it is just a matter of de-installing an rpm

Maarten: and consider that the functionality will only be used by users who knowingly would run glexec and not by accidental user

Angela: that maybe the case but we have however to follow our procedure for the deployment

Antonio: this safer approach was used by FZK with Alice as well during the CREAM pilot. They starting accessing the production queues from pre-production and then moved toward the production

Maarten: I suppose that the pre-production test could be "limited" testing only the basic authorisation workflow how long should this preproduction. would LHCb be available for a quick test on this preproduction infrastructure?

Andrea: can't we use the usual trick of publishing a value of GlueCEStateStatus different from 'Production'

Roberto: that would be preferable for LHCb

Antonio: that should prevent the production to be affected. Additional nodes could be afterwards attached to the CE to ramp-up

Maarten: do you have enough resources to set up such an additional CE?

Angela: after a pre-production test on an isolated set of WNs, if everything works we could to redirect the new CE's queue on the same physical WNs of the production system

IN2P3 (via chat)

A requirement on the distribution

  • [00:51:55] Pierre Girard We are using a relocatable installation of WN... So your RPM must be relocatable.

A safe way to deploy glexec without affecting the production

  • [01:07:48] Pierre Girard We are able to deal with 2 different versions of the MW at same time. Our CE configuration permits to choose which WN set-up to use for the jobs.

Use cases (by Users)

LHCb and Atlas are ready to integrate the installation at FZK and the next one in IN2P3 within their production submission frameworks

Andrea: CMS and in particular the developers of the glideins are in general interested although I don't think that they are so ahead to start using intensively the pilot. They will be informed. and they will presumably use the set-up for testing.

(the lcg-CE at FZK will initially have to support the three VOs CSM Atlas and LHCb)

Gianni: SA3

Metrics (by Users)

Antonio summarised the goals of the pilots, as elaborated by Gianni

The general goals of the pilot are:

To verify, from a functional point of view, the correct executions of users' jobs through pilot jobs with SCAS authz
The main point here is to verify the integration of glexec-SCAS with the experiments' frameworks, which is not a priority in certification.

To verify the reliability and the maintainability of the SCAS service
The main point here is to verify that the SCAS service can be easily configured and maintained at a site installation, with a production-like setup that can be different from the one used for certification.

General Agreement on Service Level and Conditions

not covered

Timeline

A tentative timeline is discussed:

1 week is needed to FZK (unless there are major problems) to deploy the pre-production system starting from the 9th

On the 16th LHCb could be allowed to start

Gianni: the certification is still in progress with stress testing. It should be completed hopefully within ten days

Antonio: if there are no delays with the delivery of the glexec error codes it would make sense for IN2P3 to start only with this new version, instead of having to do an upgrade very close in time

After all issues are under control and everybody agrees that the service is behaving correctly, the pilot will be left in operation for two extra weeks and then closed.

(a gantt diagram with the initial agreed timeline is available at PpsPilotSCAS#Overall_Planning)

AOB

AOB Pierre (via chat): [17:14:41] Pierre Girard Is it possible to make load-balancing with the SCAS Service ? This service is critical and its high availability will be my major concerns.

Antonio (off-line): the issue was already pointed out during the first tests. Load balancing is currently not possible, but there is work in progress to enable this capability.

Actions

Actions for SA1

The actions related to his pilot and owned by SA1 are tracked with the PPS Task Tracker
http://www.cern.ch/pps/index.php?dir=./ActivityManagement/TaskTracker/
Specifically via the subtasks of TASK:8986 (Set-up and run SCAS Pilot )

the tasks initially defined are:
TASK:8986 : Set-up and run SCAS Pilot , 62 days left

  • TASK:8987 : Set up a YUM repository at CNAF (finished on 6th-Feb) - assigned to CNAF
  • TASK:8988 : Install SCAS service, special CE and glexec @ FZK, 6 days left - assigned to FZK

Actions for JRA1, SA3, experiments

  • JRA1 (Oscar): to provide a fix to better detail error codes of glexec. Due: 20-Feb-09 . Note: it would be useful to the interfacing systems if the specification of the interface could be provided at lease a week earlier
  • SA3 (Gianni): to finish the certification of PATCH:2767, PATCH:2635, PATCH:2770. Due: 20-Feb-08

recording of the phone conference (EVO)

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatrar pilotSCAS-09-02-05-1006.rar r1 manage 37962.3 K 2009-02-09 - 15:03 AntonioRetico recording of the phone conference
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2009-02-09 - AntonioRetico
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback