Pilot Kick-Off Meeting Minutes Tue 1 Dec 2009

  • Date: Tue 1 Dec 2009
  • Agenda: 75302
  • Description: Pilot Service: glexec/Argus (experiments)
  • Chair: Antonio Retico

Attendance

  • SA1: Antonio Retico
  • SRCE: Nikola Garafolic
  • Switch: Alessandro Usai
  • Cesnet: Tomas Kouba
  • FZK: Angela Poschlad
  • INFN-CNAF: Giuseppe Misurelli, Andrea Ceccanti
  • SA3: Gianni Pucciani
  • SAM/Nagios: -
  • Argus Development : Christoph Witzig, Chad La Joie
  • CMS: Andrea Sciaba', Sanjay Padhi
  • ATLAS: Massimo Lamanna, Maxim Potekhin, Jose Caballero
  • Alice: Patricia Mendez Lorenzo
  • LHCb: Apologies

Service Introduction (by SA1)

This pilot aims to replicate very closely the kind of testing done for the chain glexec/SCAS . Normally the end users (the LHC experiments) are involved in the pilot already at the kick-off, but as Atlas wasn't available to participate last week the meeting with the experiments was scheduled for today. Issues and planning strictly related to installations were discussed last week. The aim of the meeting today is to review the plan in the light of the agendas of the four experiments and of their actual availability to use the service provided.

NOTE: as there is a risk of misunderstanding using the word 'pilot' in this context I will use the locution ' experimental service ' (although not conforming to the current EGEE naming convention) to indicate this coordinated testing activity done by VO users on production sites.

Christoph was invited to give a short introduction to the Argus service. He confirmed that from the user perspective (use of glexec) there is no functional change , the main advantages of the new system being in the more powerful policy language and other features of interest for the site administrators.

Maxim (Atlas) asked about the current recommendation of WLCG to sites (whether to use SCAS or Argus) Antonio and Massimo: SCAS is still the only solution at the date of today formally approved for deployment in production by WLCG . As the new Argus service went though a smooth and successful certification though, sites were given the option to volunteer and test the new service in production as well. So far a certain number of sites interested in the new features have subscribed to this call and that's the reason for this prototyping activity.

Use cases (by Users)

I report below for reference the use cases of the experimental service as proposed last week and forwarded to the experiments.


  • Experiment framework using glexec for production pilot jobs.
  • Test of grid-wise banning feature by OSCT
  • Gathering of requirements and analysis for monitoring tools

A lively discussion followed (~ 40 min ) concerning the use cases for the different VOs and often drifting toward technical points with the participation of all the experiment representatives and the developers. I summarise below the main points by VO.

Atlas

Atlas has participated to the previous activity focused on glexec and SCAS investing a considerable effort in debugging the installation of glexec at various sites. This has proved to be very valuable in order to understand what using glexec really means (in terms e.g. of environment of the pilot job, working directory etc.) .

The integration work from the functional point of view can be considered closed. They strongly recommend the VOs that have now to start the integration process to check the Atlas reports from the SCAS experimentation which can save a lot of effort

The main point of interest for Atlas in the current round of testing of glexec is the scalability part. E.g. to verify that sudden re-starts of activity at a site and the consequent wave of jobs soliciting the authorization service all at the same time don't determine conditions that can lead to the crash or death of the service itself and the collapse of the site.
In this respect they expressed the desire to analyse the results of the testing done in certification in order to study the operational specifications of the Argus back-end and consequently to make a proposal for an operational use case based on their workflow.
Antonio noticed that the non-functional requirements on Argus (e.g. maximum rate of requests) were surely object of previous discussion and agreement with the experiments as well so it is likely that most of these reasonable concerns are already duly addressed by the software. However Atlas is welcome to check the results from certification.


NOTE: the results of certification are now linked from the Argus Pilot twiki page.

As an additional use case Atlas can consider the possibility of using glexec to instrument 'multi-user' pilots (meaning that several identity switches are done in sequence in the scope of the same pilot)

The requirement of Atlas on the testing infrastructure is that the sites offer a reasonable amount of resources (e.g > 100 cores) behind their glexec-enabled queues, so that real production workflows can be implemented.
Questioned by Antonio on the availability of CNAF to deploy such an amount of resources, Andrea replied that in principle the T1 is interested in the activity provided that the software to be deployed is proven to be stable and harmless for the production. He cannot commit immediately though.

Input for the planning
Atlas excluded the possibility to get involved at the early stages to test the set-up of the sites. They suggested that that part of the schedule is defined with the experiments that have now to undertake the integration work. Atlas is available to get involved in a second phase, when more resources will be commissioned at the sites. Antonio is requested to propose to Atlas a timeline for this ( action )

Atlas people left the meeting earlier because due to another meeting.

CMS

CMS need to start the integration of their glidein submission framwork wiht glexec.

CMS shares with Atlas a number of technical concerns about the scalability of the site authorization service. In particular Sanjay pointed out that in his previous experience he has observed that upon exceeding number of authorization requests (to LCMAPS?) he ended up with incorrect mapping.

They would like to see the results of testing as well.

Christoph specified that the behaviour of glexec was simulated in testing and the system was able to sustain a load of ~60 requests/sec


NOTE: A summary of the load and aging tests done before the certification was sent by Christoph after the meeting

  • Load tests:
    • Service Host: 1x 2.33GHz CPU, 1gig ram
    • client recreated - simulates what glexec would do
      • ~60 req/sec, ~160ms (limited by spawning processes)
    • client reused - simulates what CREAM/WMS would do
      • ~240 req/sec, ~120ms
    • client reused, repeat request - simulates pilot jobs
      • ~1000 req/sec, ~37.6ms
  • Aging tests:
    • Test operation over several days with several mio requests
    • Memory usage: stable

Input to planning
Due to data-taking activity in progress CMS cannot schedule the beginning of the integration work before January/February (set to mid-February for planning purposes)

CMS people left the meeting earlier due to a concurrent one.

Alice

Alice needs to start the integration work. They don't use test systems but the integration will be done directly on the production framework so this can be done only if there i no concurrent activity e.g. during the LHC shutdown . They are available to work with sites supporting Alice (special requirements on VOBOX) independently on whether they are in the production grid or not.

Upon suggestion from Jose they count to gain some advantage by studying the glexec user experiences from Atlas (they will be linked from the pilot twiki page)

input for planning
Patricia estimates that Alice will be able to start the integration work roughly around the second/third week of January

LHCb

LHCb was absent but sent a statement for the minutes:

From: Roberto Santinelli 
Sent: Tuesday, December 01, 2009 4:00 PM
To: egee-pilot-argus (Argus pilot service)
Cc: Philippe Charpentier
Subject: Argus and LHCb

Ciao Antonio, 
  I feel responsible to pass the following sentences (Philippe to rubber stamp them): 

1.   ARGUS or SCAS as authz services (differently than gLExec) are not effectively providing interfaces to the end users and then LHCb (but all VOs in general) might not feel concerned by testing them. This is purely infrastructure (differently than gLExec whose need to be integrated in the framework of the experiment driven the previous SCAS/gLExec pilot)
2.   It is very likely we shall have collisions pretty soon, even maybe at record energy. Therefore we don't have much effort to distract from getting these data properly analyzed and all experts in DIRAC are fully busy with that.
3.   LHCb is still *not* fully convinced that gLexec+SCAS  can be deployed everywhere smoothly; why should LHCb have already to look at its evolution? Is there a GLexec version based on Argus?

All of that is and remains true despite the fact we also think that this PPS pilot might offer the possibility to cross check that switching backend from SCAS/LCAS to ARGUS the interface to experiment frameworks offered by gLExec *does not change*. LHCb eagerly aim this is still the case. 

Timeline

Antonio resumed the plan as it was discussed last week by the sites (reported here for convenience )


Task Owner Start Date Due Date Status
Set-up repositories and documentation SA1, SA3, CNAF 23-Nov-09 24-Nov-09 In Progress
Preliminary installation (ARGUS, WN, CE) SWITCH 25-Nov-09 27-Nov-09 To start
Core installations (ARGUS, WN, CE) FZK, SRCE, CESNET, CNAF 30-Nov-09 04-Dec-09 To start

Constraints and milestones

  • kick-off with sites: 25-Nov (11 AM CET)
  • 1st site technically available for Experiments to test (SWITCH): 1-Dec
  • kick-off with experiments: 1-Dec (11 AM CET)
  • All sites technically available for Experiments to test (SWITCH): 7-Dec
  • END of activity (proposed): 29-Jan

New events since then:

  • The first deadline is met: a first installation was actually performed at SWITCH. The endpoints are listed at https://twiki.cern.ch/twiki/bin/view/EGEE/PilotServiceArgus#SWITCH . The batch system behind the CE (few WNs) cannot handle real production activity but is suitable to be used for test jobs. The site currently fully supports Atlas according to the requirements defined during the glexec/SCAS pilot. An installation log was made available as instructions for the other sites in the task report . Main comments to this report are
    • Alessandro confirmed that the installation of the Argus server with YAIM proved to be smooth, viable and well documented. Other sites can surely start it if they want to gain experience with the new service.
    • Antonio observed that the procedure used to install the Argus-compatible glexec WNs is in his opinion way too complicated to be applied at production sites (this is due to the fact that the installation was performed using the content of patch 3093 currently not yet certified and properly distributed). As the experiments have excluded to star their activity before the end of December he suggests to suspend the installation at the other sites until the ppatches are in a better shape.
  • Following-up the point moved by CESNET about the installation on SL4 WNs the integration team requested the glexec developers on the 26th of November to provide a patch for this
    • As pointed out by Christoph this still needs to be formalised. Gianni will open a bug (Action)

In consideration of these findings and the input from the experiments' planning the following changes are applied to the plan

  • The starting of the installation works at the other sites is suspended waiting for the following
    • For SL5 sites: Argus-enabled glexec plugin to be certified on SL5 (PATCH 3093)
    • For SL4 sites: Argus-enabled glexec plugin to be certified on SL4 (PATCH still to be created)
    • A corresponding repository to be set-up (at CNAF)

  • A check-point meeting will be called by mid-December (date to be decided) to verify the progresses on that. Participation of the experiments is welcome but not mandatory

  • One milestone moved:
    • All sites technically available for Experiments to test : 15-Dec &rarr 15-Jan
  • New milestones defined as a reference for SA1 planning
    • Indicative start of Alice developments to integrate glexec: 18-Jan
    • Indicative start of CMS developments to integrate glexec: 15-Feb

NOTE: The END of activity will need to be moved forward (point not discussed at the meeting)

Actions

Mentioned actions are reported on the pilot twiki


Assigned to Due date Description State Closed Notify  
GiuseppeMIsurelli 2010-02-19 Provide a report describing the issues being faced by CNAF for the installation of glexec on the WNs.

INFN-T1 is experiencing a problem on the stability of GPFS interacting with the WN on demand system adopted locally into the resource center.
Since they decided to provide virtual WNs for the pilot, the issue is affecting consequently the deployment of the glexec WN component into the site.

2010-02-19 MaartenLitmaath edit

Assigned to Due date Description State Closed Notify  
GianniPucciani 2010-02-19 Provide functional specification of glexec tests being implemented at SRCE     edit
ChadLaJoie 2010-02-03 Provide instructions on how to preserve local policies during the upgrade of the Argus server to a newer version both in an e-mail to the sites and in the PATCH:3536

this was done on the 2nd of February
This is done now at https://savannah.cern.ch/patch/?3536

2010-03-02 MaartenLitmaath edit


Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2009-12-03 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback