INFN-CNAF: Danilo Dongiovanni, Giuseppe Misurelli, Andrea Ceccanti
SA3: Gianni Pucciani
SAM/Nagios: John Shade
Argus Development : Christoph Witzig, Chad La Joie
Service Introduction (by SA1)
This pilot aims to replicate very closely the kind of testing done for the chain glexec/SCAS . Normally the end users (the LHC experiments) are involved in the pilot already at the kick-off, but as Atlas wasn't available to participate today the meeting with the experiments is going to be scheduled for next week. Issues and planning strictly related to installations are discussed today. The corresponding planning as it is proposed today may have to be changed in order to be compatible with the agenda of the experiments involved.
Use cases (by Users)
Use cases for the pilot are discussed and approved.
The main use case is the same as in the previous pilot: Experiment framework using glexec for production pilot jobs . Approved without discussion
Christoph proposed a second use case: Test of grid-wise banning feature by OSCT . This use case is accepted in the scope of the pilot . OSCT will have to be involved in the next meetings.
Antonio mentioned the possibility to extend this concept to include black-lists of users maintained by the experiments.
Christoph replied that this extension, although technically feasible, was not included among the policies originally discussed with the sites and WLCG. Even if proved to be working this wouldn't mean that such a method would be applied. It may be a sensitive issue for LGC and and some individual sites. In his view this should first be discussed at the policy level. Note: OSCT banning lists were already discussed and approved at JSPG and GDB level.
The decision was made no to include this extension in the scope of the pilot.
Antonio pointed out that the pilot is a good opportunity to verify implications for monitoring tools as well. As Argus is a new service for production it will have to be conveniently monitored. So he proposes a side use-case of this pilot: the SAM/Nagios team could observe/use the pilot installation in order to get requirements for the future probes/tests
John Shade described his ideas for monitoring glexec: basically the idea is to provide a functional test of glexec independent on the back-end (Argus or SCAS) to be run at the sites on behalf of the VOs (with their proxy). Questioned by Giuseppe on the VO used he excluded the use of OPS, which would require confiuration of apposite roles at the sites without adding real value to the test of the functionality for the experiments
Giuseppe(CNAF) remarked that whatever test done by the SAM/Nagios team should not interfere with the production system. In particular he is concerned that test jobs with undesired effects may be run on the production system using a VO proxy
Alessandro (SWITCH) said that he could eventually host test jobs at his site (few worker nodes on virtual machines with no real production aspirations)
Antonio stressed that the emphasis is not on testing new probes but rather on gathering requirements. However he reassured the sites saying that the possibility of running test jobs from SAN/Nagios on the pilot infrastructure will be always discussed beforehand on demand with the participating sites. NOTE: After the meeting Christoph pointed out that they already developed some very simple nagios scripts, which can also be tested and adapted by the SA1 team .
The final list of use cases to consider is:
Experiment framework using glexec for production pilot jobs.
Gathering of requirements and analysis for monitoring tools
Metrics (by Users)
The corresponding success criteria for the pilots are
Chain glexec - Argus demonstrated to interact correctly with LHC Exepriments' frameworks for pilot jobs
Maintenance and operations of the Argus service declared supportable by the sites
OSCT able to ban a user on the whole pilot infrastructure without specific intervention of the site administrators
Collection of exhaustive requirements for the implementation of monitoring tools
Timeline
The timeline originally proposed by Antonio was discussed
Task
Owner
Start Date
Due Date
Status
Set-up repositories and documentation
SA1, SA3, CNAF
23-Nov-09
24-Nov-09
In Progress
This is still in progress. Questions were raised by Danilo (CNAF) about the content of the repository to be set-up at CNAF. In particular related to rpms known to be needed for the initial installation but not yet certified.
Andrea (CNAF) reported about installation of Argus started at CNAF using the rpms from the patch repository in certification.
Antonio reminded that it is accepted practice to start a pilot only when all the needed packages are certified (short excursus on the difference between pilot and experimental services). Having said that if the sites agree (as CNAF seems to do) he is not against starting with rpms not yet certified provided that the rpms are duly documented in a patch
Task
Owner
Start Date
Due Date
Status
Preliminary installation (ARGUS, WN, CE)
SWITCH
25-Nov-09
27-Nov-09
In Progress
Alessandro confirmed this as feasible. Actually the installation is almost finished (correct version of glexec installed manually on WN). He demands someone to test the functionality of the installation.
Other sites wait for a report from SWITCH that the installation was successful before starting to install. In particular, the YAIM configuration of GLEXEC_WN must be finalized (SWITCH is working with SA3 at NIKHEF on this) ActionAction: Gianni will send out details about the tests he run in certification (please describe them among the instructions for sites in the Pilot twiki page).
Milestone : 1st site technically available for Experiments to test (SWITCH): 1-Dec
Task
Owner
Start Date
Due Date
Status
Core installations (ARGUS, WN, CE)
FZK, SRCE, CESNET, CNAF
30-Nov-09
04-Dec-09
To start
Andrea (CNAF): Confirmed
Emir (SRCE): confirmed
Angela (FZK): shall reply off-line (Action)
Tomas (CESNET) : expressed a reserve on the availability of add-ons for WN (LCAS/LCMAPS plugins) on SL4 baseline as well. Upgrading the farm to SL5 is not feasible in this timeline.
This is an open point. Action: Antonio & Gianni to enumerate available deployment scenarios and see whether new developments have to be requested (or re-negotiations are needed with the sites)
Milestone : Remaining sites technically available for Experiments to test (SWITCH): 7-Dec
A reasonable end date of the pilot is approved to be the end of January ( Milestone )
All milestones to be confirmed after discussion with the experiments.
General Agreement on Service Level and Conditions
Antonio proposes to re-use the best practices used in previous pilots to manage internal releases: He gave a quick summary (more detailed info written in these minutes)
Within the pilot sites and developers agree on an "accelerated release process" in order to save time.
Basically as soon as a modification to the software is deemed necessary a patch is created and left in status "With Provider" .
The modifications delivered to the pilot are documented there incrementally while they are distributed to the pilot. In this way the changes are documented appropriately and at the same time the existing patches can follow independently their standard path through certification.
Software delivered to the pilot is made available to the sites through a yum repository set-up ad hoc and updated on demand of the pilot coordinator.
The certification team is involved in the pilot so they can watch on the content and raise exceptions on the patches earlier if needed. Of course the choice to follow the standard process is not precluded to the developers, provided that fixes are provided quickly if needed. In particular the use of ETICS versions at the earliest possible stages is highly encouraged.
Tasks for the sites (installations) are tracked via dedicated Savannah tickets (tasks). They come with pre-defined estimated effort that can be actualised by the site administrator if needed
The effort spent by the sites for the operation of the pilot instances (e.g. application of subsequent updates, debugging, babysitting ) is forfeited via an "Operational task". this task is issued and assigned to the site after the installation is completed and closed when the service is finally decommissioned from the pilot. This task, apart from measuring the effort is also used to track major operational issues encountered by the service administrators
AOB
Date of the next meeting (next week) to be defined off line after checking the availability of all interested parties. The participation of representatives of the development team to the kick-off with the experiemnts is highly encouraged
Provide a report describing the issues being faced by CNAF for the installation of glexec on the WNs. INFN-T1 is experiencing a problem on the stability of GPFS interacting with the WN on demand system adopted locally into the resource center. Since they decided to provide virtual WNs for the pilot, the issue is affecting consequently the deployment of the glexec WN component into the site.
Provide instructions on how to preserve local policies during the upgrade of the Argus server to a newer version both in an e-mail to the sites and in the PATCH:3536 this was done on the 2nd of February This is done now at https://savannah.cern.ch/patch/?3536