Pilot Kick-Off Meeting Minutes Wed 25 Nov 2009

  • Date: Wed 25 Nov 2009
  • Agenda: 75146
  • Description: Pilot Service: glexec/Argus
  • Chair: Antonio Retico

Attendance

  • SA1: Antonio Retico
  • SRCE: Emir Imamagic, Nikola Garafolic
  • Switch: Alessandro Usai
  • Cesnet: Tomas Kouba
  • FZK: apologies
  • INFN-CNAF: Danilo Dongiovanni, Giuseppe Misurelli, Andrea Ceccanti
  • SA3: Gianni Pucciani
  • SAM/Nagios: John Shade
  • Argus Development : Christoph Witzig, Chad La Joie

Service Introduction (by SA1)

This pilot aims to replicate very closely the kind of testing done for the chain glexec/SCAS . Normally the end users (the LHC experiments) are involved in the pilot already at the kick-off, but as Atlas wasn't available to participate today the meeting with the experiments is going to be scheduled for next week. Issues and planning strictly related to installations are discussed today. The corresponding planning as it is proposed today may have to be changed in order to be compatible with the agenda of the experiments involved.

Use cases (by Users)

Use cases for the pilot are discussed and approved.

The main use case is the same as in the previous pilot: Experiment framework using glexec for production pilot jobs . Approved without discussion

Christoph proposed a second use case: Test of grid-wise banning feature by OSCT . This use case is accepted in the scope of the pilot . OSCT will have to be involved in the next meetings.
Antonio mentioned the possibility to extend this concept to include black-lists of users maintained by the experiments.
Christoph replied that this extension, although technically feasible, was not included among the policies originally discussed with the sites and WLCG. Even if proved to be working this wouldn't mean that such a method would be applied. It may be a sensitive issue for LGC and and some individual sites. In his view this should first be discussed at the policy level. Note: OSCT banning lists were already discussed and approved at JSPG and GDB level.
The decision was made no to include this extension in the scope of the pilot.

Antonio pointed out that the pilot is a good opportunity to verify implications for monitoring tools as well. As Argus is a new service for production it will have to be conveniently monitored. So he proposes a side use-case of this pilot: the SAM/Nagios team could observe/use the pilot installation in order to get requirements for the future probes/tests
John Shade described his ideas for monitoring glexec: basically the idea is to provide a functional test of glexec independent on the back-end (Argus or SCAS) to be run at the sites on behalf of the VOs (with their proxy). Questioned by Giuseppe on the VO used he excluded the use of OPS, which would require confiuration of apposite roles at the sites without adding real value to the test of the functionality for the experiments
Giuseppe(CNAF) remarked that whatever test done by the SAM/Nagios team should not interfere with the production system. In particular he is concerned that test jobs with undesired effects may be run on the production system using a VO proxy
Alessandro (SWITCH) said that he could eventually host test jobs at his site (few worker nodes on virtual machines with no real production aspirations)
Antonio stressed that the emphasis is not on testing new probes but rather on gathering requirements. However he reassured the sites saying that the possibility of running test jobs from SAN/Nagios on the pilot infrastructure will be always discussed beforehand on demand with the participating sites.


NOTE: After the meeting Christoph pointed out that they already developed some very simple nagios scripts, which can also be tested and adapted by the SA1 team .

The final list of use cases to consider is:

  • Experiment framework using glexec for production pilot jobs.
  • Test of grid-wise banning feature by OSCT
  • Gathering of requirements and analysis for monitoring tools

Metrics (by Users)

The corresponding success criteria for the pilots are

  1. Chain glexec - Argus demonstrated to interact correctly with LHC Exepriments' frameworks for pilot jobs
  2. Maintenance and operations of the Argus service declared supportable by the sites
  3. OSCT able to ban a user on the whole pilot infrastructure without specific intervention of the site administrators
  4. Collection of exhaustive requirements for the implementation of monitoring tools

Timeline

The timeline originally proposed by Antonio was discussed

Task Owner Start Date Due Date Status
Set-up repositories and documentation SA1, SA3, CNAF 23-Nov-09 24-Nov-09 In Progress

This is still in progress. Questions were raised by Danilo (CNAF) about the content of the repository to be set-up at CNAF. In particular related to rpms known to be needed for the initial installation but not yet certified.

Andrea (CNAF) reported about installation of Argus started at CNAF using the rpms from the patch repository in certification.

Antonio reminded that it is accepted practice to start a pilot only when all the needed packages are certified (short excursus on the difference between pilot and experimental services). Having said that if the sites agree (as CNAF seems to do) he is not against starting with rpms not yet certified provided that the rpms are duly documented in a patch

Task Owner Start Date Due Date Status
Preliminary installation (ARGUS, WN, CE) SWITCH 25-Nov-09 27-Nov-09 In Progress

Alessandro confirmed this as feasible. Actually the installation is almost finished (correct version of glexec installed manually on WN). He demands someone to test the functionality of the installation.
Other sites wait for a report from SWITCH that the installation was successful before starting to install. In particular, the YAIM configuration of GLEXEC_WN must be finalized (SWITCH is working with SA3 at NIKHEF on this) Action

Action: Gianni will send out details about the tests he run in certification (please describe them among the instructions for sites in the Pilot twiki page).

Milestone : 1st site technically available for Experiments to test (SWITCH): 1-Dec

Task Owner Start Date Due Date Status
Core installations (ARGUS, WN, CE) FZK, SRCE, CESNET, CNAF 30-Nov-09 04-Dec-09 To start

Andrea (CNAF): Confirmed

Emir (SRCE): confirmed

Angela (FZK): shall reply off-line (Action)

Tomas (CESNET) : expressed a reserve on the availability of add-ons for WN (LCAS/LCMAPS plugins) on SL4 baseline as well. Upgrading the farm to SL5 is not feasible in this timeline.
This is an open point.
Action: Antonio & Gianni to enumerate available deployment scenarios and see whether new developments have to be requested (or re-negotiations are needed with the sites)

Milestone : Remaining sites technically available for Experiments to test (SWITCH): 7-Dec

A reasonable end date of the pilot is approved to be the end of January ( Milestone )

All milestones to be confirmed after discussion with the experiments.

General Agreement on Service Level and Conditions

Antonio proposes to re-use the best practices used in previous pilots to manage internal releases:
He gave a quick summary (more detailed info written in these minutes)

Within the pilot sites and developers agree on an "accelerated release process" in order to save time.
Basically as soon as a modification to the software is deemed necessary a patch is created and left in status "With Provider" .
The modifications delivered to the pilot are documented there incrementally while they are distributed to the pilot. In this way the changes are documented appropriately and at the same time the existing patches can follow independently their standard path through certification.
Software delivered to the pilot is made available to the sites through a yum repository set-up ad hoc and updated on demand of the pilot coordinator.
The certification team is involved in the pilot so they can watch on the content and raise exceptions on the patches earlier if needed. Of course the choice to follow the standard process is not precluded to the developers, provided that fixes are provided quickly if needed. In particular the use of ETICS versions at the earliest possible stages is highly encouraged.

Tasks for the sites (installations) are tracked via dedicated Savannah tickets (tasks). They come with pre-defined estimated effort that can be actualised by the site administrator if needed

The effort spent by the sites for the operation of the pilot instances (e.g. application of subsequent updates, debugging, babysitting ) is forfeited via an "Operational task". this task is issued and assigned to the site after the installation is completed and closed when the service is finally decommissioned from the pilot. This task, apart from measuring the effort is also used to track major operational issues encountered by the service administrators

AOB

Date of the next meeting (next week) to be defined off line after checking the availability of all interested parties. The participation of representatives of the development team to the kick-off with the experiemnts is highly encouraged

Actions

Mentioned actions are reported on the pilot twiki


Assigned to Due date Description State Closed Notify  
GiuseppeMIsurelli 2010-02-19 Provide a report describing the issues being faced by CNAF for the installation of glexec on the WNs.

INFN-T1 is experiencing a problem on the stability of GPFS interacting with the WN on demand system adopted locally into the resource center.
Since they decided to provide virtual WNs for the pilot, the issue is affecting consequently the deployment of the glexec WN component into the site.

2010-02-19 MaartenLitmaath edit

Assigned to Due date Description State Closed Notify  
GianniPucciani 2010-02-19 Provide functional specification of glexec tests being implemented at SRCE     edit
ChadLaJoie 2010-02-03 Provide instructions on how to preserve local policies during the upgrade of the Argus server to a newer version both in an e-mail to the sites and in the PATCH:3536

this was done on the 2nd of February
This is done now at https://savannah.cern.ch/patch/?3536

2010-03-02 MaartenLitmaath edit


Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2009-11-26 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback