Pilot Follow-up Meeting Minutes Tue 16 Feb 2010

  • Date: Tue 16 Feb 2010
  • Agenda: 83746
  • Description: Pilot of glexec/Argus : Check-point
  • Chair: Antonio Retico
  • Home: PilotServiceArgus


  • Operations/SA1: Antonio Retico
  • Certification/SA3: Gianni Pucciani
  • Development/JRA1: Christoph Witzig
  • SRCE: Nikola Garafolic, Emir Imamagic
  • Switch: Alessandro Usai
  • Cesnet: -
  • FZK: Angela Poschlad
  • INFN-CNAF: Giuseppe Misurelli
  • IFIC: Javier Nadal (muted)
  • SAM/Nagios: Emir Imamagic
  • CMS: Claudio Grandi
  • ATLAS: -
  • Alice: Patricia Mendez
  • LHCb: -
  • WLCG (Pilot Jobs Working Group): Maarten Litmaath

NOTE: there were severe issues during the EVO phone conference so many of the participants were unable to talk. A backup call had to be instantiated on the Alcatel system at 2:10

Review of action items (tasks)

SA1/SA3 tasks

Status of the subtasks of TASK:12720(see them in the PPS tracker ) .

other tasks

Assigned to Due date Description State Closed Notify  
GiuseppeMIsurelli 2010-02-19 Provide a report describing the issues being faced by CNAF for the installation of glexec on the WNs.

INFN-T1 is experiencing a problem on the stability of GPFS interacting with the WN on demand system adopted locally into the resource center.
Since they decided to provide virtual WNs for the pilot, the issue is affecting consequently the deployment of the glexec WN component into the site.

2010-02-19 MaartenLitmaath edit

Assigned to Due date Description State Closed Notify  
GianniPucciani 2010-02-19 Provide functional specification of glexec tests being implemented at SRCE     edit
ChadLaJoie 2010-02-03 Provide instructions on how to preserve local policies during the upgrade of the Argus server to a newer version both in an e-mail to the sites and in the PATCH:3536

this was done on the 2nd of February
This is done now at https://savannah.cern.ch/patch/?3536

2010-03-02 MaartenLitmaath edit


Status and results of the pilot service (by VOs and sites)

CNAF T1: Giuseppe reported on the status of the installation at the T1 which is still not finished. As explained in the last update of the task the issue is related to the deployment of glexec on the production WNs. They hope to be able to finish the installation by the end of February, but they are not in the condition to give a timeline now.

Antonio suggested that a more detailed report on the installation issues would be useful in order to understand early possible deployment issues that may be common to other sites. Giuseppe will provide such a description after consulting with the T1 admins.

SRCE (Emir): role=pilot for OPS is now enabled at SRCE and they have starting working on the integration and configuration of the glexec tests in Nagios. In order to enable the submission they needed to get the role defined in the ACLs on the WMS at CERN . They plan to finishe this initial integration work by the end of this week (19th-Feb) and have the test results for the site SRCE available at https://cs2-egee.srce.hr/nagios/

Antonio asked for a functional description of the tests currently being implemented for documentation purposes. Gianni will provide it with help from Konstantin

Giuseppe (answering to Patricia) confirmed that the glexec enabled queues will be served by a Cream CE (they are already at INFN-CNAF). Maarten reminded that a bug has recently been reported (http://savannah.cern.ch/bugs/?62810) affecting both WMS-based and direct submission to CREAM: the job ends up with GLITE_LOCATION unset, while on EGEE that variable currently is needed to discover the location of the glexec executable. IT/GT are working on a fix with high priority.

FZK (Angela) The PPS installation based on Argus has been scaled up to 300 cores . Fter the update of the Argus server to 1.1 there are problems in querying the status of the pepd, currently under investigation. However this does not impact on the service because the new version is installed on a test node
IFIC (Javier) Couldn't be heard due to problems in the conference tool
CESNET: Absent:
INFN-CNAF: covered with T1
SWITCH: nothing to report
Operations: Antonio

there is now one site (FZK/KIT) providing the glexec capability at a more than reasonable scale . Users can infact acces to 300 cores supported by Argus and ~5000 slots covered by SCAS . Therefore it's probably time to call in Atlas again. They will be invited to the next check point which will be held at 16PM

CMS: (Claudio) the needed configuration in VOMS to enable the role=pilot was done in crab and last week a number of tests were done against sites in the US (OSG). Among the T2 sin Europe only Desy was included in the list. There glexec (+ SCAS) was tried in logging mode (Maarten recalled in fact that Desy had previously reported difficulties in working in setuid mode due to local job dirs clean-up policies).
CMS would prefer to run these tests only against T2s but is available to run against T1s as this requires only minor changes in the configuration.

Antonio questioned on the requirement previously set by CMS on the availability of the CNAf T1 for the tests. He pointed out that tests at a good production scale can be carried out at FZK/KIT as well. Angela confirmed that the installation at KIT is fully supporting CMS. The access points are documented on the twiki .Claudio will pass this information to Igor Sfiligoi (glidein developer)

Alice: They have competed the analysis of the impact of the glexec calls in the Alien architecture and they have planned a staged approach for the implementation. Steffen Schreiner (PHd) is currently working on it but there are no news for the tests yet. They have made an agreement with CNAF to run against that site when ready.

Antonio invited not to consider the availability of CNAF as a constraint and to consider also other sites in the pilot (e.g. KIT) providing the dedicated VOBOX and support to Alice

Status and results of the development (by developers)

(Christoph) Argus version 1.1 is now certified. It is important to point out that sites upgrading to this version should upgrade the WNs as well using a version of the LCMAPS plugin which is not certified yet.

The delay in the certification has probably to do with the mutated release process and we hope to have it certified within th enext weeks.

Open Issues (by VOs, sites, deployment teams)


Recommendations for release and deployment


Decision about termination/extension of the pilot

next check point: 3-Mar-2010 at 16.00 CET



