Argus Development : Christoph Witzig, Chad La Joie
CMS: Andrea Sciaba', Sanjay Padhi
ATLAS: Massimo Lamanna, Maxim Potekhin, Jose Caballero
Alice: Patricia Mendez Lorenzo
LHCb: Apologies
Service Introduction (by SA1)
This pilot aims to replicate very closely the kind of testing done for the chain glexec/SCAS . Normally the end users (the LHC experiments) are involved in the pilot already at the kick-off, but as Atlas wasn't available to participate last week the meeting with the experiments was scheduled for today. Issues and planning strictly related to installations were discussed last week. The aim of the meeting today is to review the plan in the light of the agendas of the four experiments and of their actual availability to use the service provided.
NOTE: as there is a risk of misunderstanding using the word 'pilot' in this context I will use the locution ' experimental service ' (although not conforming to the current EGEE naming convention) to indicate this coordinated testing activity done by VO users on production sites.
Christoph was invited to give a short introduction to the Argus service. He confirmed that from the user perspective (use of glexec) there is no functional change , the main advantages of the new system being in the more powerful policy language and other features of interest for the site administrators.
Maxim (Atlas) asked about the current recommendation of WLCG to sites (whether to use SCAS or Argus)
Antonio and Massimo: SCAS is still the only solution at the date of today formally approved for deployment in production by WLCG . As the new Argus service went though a smooth and successful certification though, sites were given the option to volunteer and test the new service in production as well. So far a certain number of sites interested in the new features have subscribed to this call and that's the reason for this prototyping activity.
Use cases (by Users)
I report below for reference the use cases of the experimental service as proposed last week and forwarded to the experiments.
Experiment framework using glexec for production pilot jobs.
Gathering of requirements and analysis for monitoring tools
A lively discussion followed (~ 40 min ) concerning the use cases for the different VOs and often drifting toward technical points with the participation of all the experiment representatives and the developers. I summarise below the main points by VO.
Atlas
Atlas has participated to the previous activity focused on glexec and SCAS investing a considerable effort in debugging the installation of glexec at various sites. This has proved to be very valuable in order to understand what using glexec really means (in terms e.g. of environment of the pilot job, working directory etc.) .
The integration work from the functional point of view can be considered closed. They strongly recommend the VOs that have now to start the integration process to check the Atlas reports from the SCAS experimentation which can save a lot of effort
The main point of interest for Atlas in the current round of testing of glexec is the scalability part. E.g. to verify that sudden re-starts of activity at a site and the consequent wave of jobs soliciting the authorization service all at the same time don't determine conditions that can lead to the crash or death of the service itself and the collapse of the site.
In this respect they expressed the desire to analyse the results of the testing done in certification in order to study the operational specifications of the Argus back-end and consequently to make a proposal for an operational use case based on their workflow.
Antonio noticed that the non-functional requirements on Argus (e.g. maximum rate of requests) were surely object of previous discussion and agreement with the experiments as well so it is likely that most of these reasonable concerns are already duly addressed by the software. However Atlas is welcome to check the results from certification.
NOTE: the results of certification are now linked from the Argus Pilot twiki page.
As an additional use case Atlas can consider the possibility of using glexec to instrument 'multi-user' pilots (meaning that several identity switches are done in sequence in the scope of the same pilot)
The requirement of Atlas on the testing infrastructure is that the sites offer a reasonable amount of resources (e.g > 100 cores) behind their glexec-enabled queues, so that real production workflows can be implemented.
Questioned by Antonio on the availability of CNAF to deploy such an amount of resources, Andrea replied that in principle the T1 is interested in the activity provided that the software to be deployed is proven to be stable and harmless for the production. He cannot commit immediately though.
Input for the planning
Atlas excluded the possibility to get involved at the early stages to test the set-up of the sites. They suggested that that part of the schedule is defined with the experiments that have now to undertake the integration work. Atlas is available to get involved in a second phase, when more resources will be commissioned at the sites. Antonio is requested to propose to Atlas a timeline for this ( action )
Atlas people left the meeting earlier because due to another meeting.
CMS
CMS need to start the integration of their glidein submission framwork wiht glexec.
CMS shares with Atlas a number of technical concerns about the scalability of the site authorization service. In particular Sanjay pointed out that in his previous experience he has observed that upon exceeding number of authorization requests (to LCMAPS?) he ended up with incorrect mapping.
They would like to see the results of testing as well.
Christoph specified that the behaviour of glexec was simulated in testing and the system was able to sustain a load of ~60 requests/sec
NOTE: A summary of the load and aging tests done before the certification was sent by Christoph after the meeting
Load tests:
Service Host: 1x 2.33GHz CPU, 1gig ram
client recreated - simulates what glexec would do
~60 req/sec, ~160ms (limited by spawning processes)
client reused, repeat request - simulates pilot jobs
~1000 req/sec, ~37.6ms
Aging tests:
Test operation over several days with several mio requests
Memory usage: stable
Input to planning
Due to data-taking activity in progress CMS cannot schedule the beginning of the integration work before January/February (set to mid-February for planning purposes)
CMS people left the meeting earlier due to a concurrent one.
Alice
Alice needs to start the integration work. They don't use test systems but the integration will be done directly on the production framework so this can be done only if there i no concurrent activity e.g. during the LHC shutdown . They are available to work with sites supporting Alice (special requirements on VOBOX) independently on whether they are in the production grid or not.
Upon suggestion from Jose they count to gain some advantage by studying the glexec user experiences from Atlas (they will be linked from the pilot twiki page)
input for planning
Patricia estimates that Alice will be able to start the integration work roughly around the second/third week of January
LHCb
LHCb was absent but sent a statement for the minutes:
From: Roberto Santinelli
Sent: Tuesday, December 01, 2009 4:00 PM
To: egee-pilot-argus (Argus pilot service)
Cc: Philippe Charpentier
Subject: Argus and LHCb
Ciao Antonio,
I feel responsible to pass the following sentences (Philippe to rubber stamp them):
1. ARGUS or SCAS as authz services (differently than gLExec) are not effectively providing interfaces to the end users and then LHCb (but all VOs in general) might not feel concerned by testing them. This is purely infrastructure (differently than gLExec whose need to be integrated in the framework of the experiment driven the previous SCAS/gLExec pilot)
2. It is very likely we shall have collisions pretty soon, even maybe at record energy. Therefore we don't have much effort to distract from getting these data properly analyzed and all experts in DIRAC are fully busy with that.
3. LHCb is still *not* fully convinced that gLexec+SCAS can be deployed everywhere smoothly; why should LHCb have already to look at its evolution? Is there a GLexec version based on Argus?
All of that is and remains true despite the fact we also think that this PPS pilot might offer the possibility to cross check that switching backend from SCAS/LCAS to ARGUS the interface to experiment frameworks offered by gLExec *does not change*. LHCb eagerly aim this is still the case.
Timeline
Antonio resumed the plan as it was discussed last week by the sites (reported here for convenience )
Task
Owner
Start Date
Due Date
Status
Set-up repositories and documentation
SA1, SA3, CNAF
23-Nov-09
24-Nov-09
In Progress
Preliminary installation (ARGUS, WN, CE)
SWITCH
25-Nov-09
27-Nov-09
To start
Core installations (ARGUS, WN, CE)
FZK, SRCE, CESNET, CNAF
30-Nov-09
04-Dec-09
To start
Constraints and milestones
kick-off with sites: 25-Nov (11 AM CET)
1st site technically available for Experiments to test (SWITCH): 1-Dec
kick-off with experiments: 1-Dec (11 AM CET)
All sites technically available for Experiments to test (SWITCH): 7-Dec
END of activity (proposed): 29-Jan
New events since then:
The first deadline is met: a first installation was actually performed at SWITCH. The endpoints are listed at https://twiki.cern.ch/twiki/bin/view/EGEE/PilotServiceArgus#SWITCH . The batch system behind the CE (few WNs) cannot handle real production activity but is suitable to be used for test jobs. The site currently fully supports Atlas according to the requirements defined during the glexec/SCAS pilot. An installation log was made available as instructions for the other sites in the task report . Main comments to this report are
Alessandro confirmed that the installation of the Argus server with YAIM proved to be smooth, viable and well documented. Other sites can surely start it if they want to gain experience with the new service.
Antonio observed that the procedure used to install the Argus-compatible glexec WNs is in his opinion way too complicated to be applied at production sites (this is due to the fact that the installation was performed using the content of patch 3093 currently not yet certified and properly distributed). As the experiments have excluded to star their activity before the end of December he suggests to suspend the installation at the other sites until the ppatches are in a better shape.
Following-up the point moved by CESNET about the installation on SL4 WNs the integration team requested the glexec developers on the 26th of November to provide a patch for this
As pointed out by Christoph this still needs to be formalised. Gianni will open a bug (Action)
In consideration of these findings and the input from the experiments' planning the following changes are applied to the plan
The starting of the installation works at the other sites is suspended waiting for the following
For SL5 sites: Argus-enabled glexec plugin to be certified on SL5 (PATCH 3093)
For SL4 sites: Argus-enabled glexec plugin to be certified on SL4 (PATCH still to be created)
A corresponding repository to be set-up (at CNAF)
A check-point meeting will be called by mid-December (date to be decided) to verify the progresses on that. Participation of the experiments is welcome but not mandatory
One milestone moved:
All sites technically available for Experiments to test : 15-Dec &rarr 15-Jan
New milestones defined as a reference for SA1 planning
Indicative start of Alice developments to integrate glexec: 18-Jan
Indicative start of CMS developments to integrate glexec: 15-Feb
NOTE: The END of activity will need to be moved forward (point not discussed at the meeting)
Provide a report describing the issues being faced by CNAF for the installation of glexec on the WNs. INFN-T1 is experiencing a problem on the stability of GPFS interacting with the WN on demand system adopted locally into the resource center. Since they decided to provide virtual WNs for the pilot, the issue is affecting consequently the deployment of the glexec WN component into the site.
Provide instructions on how to preserve local policies during the upgrade of the Argus server to a newer version both in an e-mail to the sites and in the PATCH:3536 this was done on the 2nd of February This is done now at https://savannah.cern.ch/patch/?3536