PPS Pilot Follow-up Meeting Minutes Fri 20 Mar 2009

  • Date: Fri 20 Mar 2009
  • Agenda: 54505
  • Description: Pilot of glexec/SCAS: check-point
  • Chair: Antonio Retico

Attendance

  • PPS: Antonio Retico
  • CERN: Gianni Pucciani (SA3)
  • CMS: -
  • Atlas: Jose Caballero, Maxim Potekhine
  • Alice: -
  • LHCb: -
  • Nikhef: (Development) -
  • IN2P3: Pierre Girard
  • PFK: Angela Potschlad

Review of action items (tasks)

SA1/SA3 tasks

Status of the subtasks of TASK:8986 (see them in the PPS tracker ) .

  • TASK:9073 (service at FZK) . Angela: Next wee thte SL5 WNs will be released which run on 64bit machines only. Can glexec work on 64bit nodes?
Gianni (off-line):There is a PATCH2847 with status "with provider", meaning that Nikhef is working on it.

  • Certification of SCAS. Gianni: The new glexec PATCH:2892 (released as clone of the previous one) was certified. The version of glexec included is 0.6.6-4. It fixes issues seen at NIKHEF by LHCb while using nfs. Although FZK doesn't use nfs, they are requested to synchronise to the new version.

other tasks


Notes:

Status and results of the pilot service (by VOs and sites)

Atlas

Jose (Atlas, PanDA) reported about his testing at FZK/gridka : His integration tests are successful. Working with Angela (thanks) he identified the three main configuration settings that that a site needs to do in order to support Atlas

  • enabling the US-Atlas role
  • need to install the myproxy-logon client on the WN
  • the vomses file has to be reachable from the WN
These requirements have now been reported on the pilot page among the configuration instructions for Atlas

Maxim explained that next step for PanDA is the implementation of a continuous "pinging" of the glexec to be done in order to verify that the service is available and correctly schedule the pilot jobs.

In view of the full integration of PanDA Atlas is suggesting that it is time to extend the pilot by ramping-up to another site

Pierre (IN2P3) is ready to start his installation. Within the incoming week (23rd-27th) he will deploy glexec on the WNs a queue on the CE and an SCAS server working on best effort. As he can redirect users to the correct versions of the MW on WNs based on the VOMS proxy he does not need to use the special GlueCE State Status='PPSPilotSCAS'
TASK:9333 was opened to track the installation.

Jose noticed that a frequent issue which I had to face at all the sites where I tried glexec is that my proxy/role was never authorised in the ACLs of the local storage element and LFC catalogue.

Antonio noticed that the correct distribution of the ACLs to be set-up on the SEs is a concern of the VO as must obey to VO policies. Therefore Maxim and Jose are kindly requested to provide a list of the users/roles which they think need to be enabled on the SEs involved in these activity.
The request of course extends to other VOs as well.

PanDA will start using the installation at IN2P3 during the week after CHEP (30/3-3/4)

LHCb

A message was sent to the list after the meeting


-----Original Message-----
From: Roberto Santinelli
Sent: Friday, March 20, 2009 2:56 PM
To: egee-pps-pilot-scas (SCAS Pilot Service)
Cc: lhcb-dirac (LHCb mailing list about DIRAC project development)
Subject: LHCb Vs gLExec summary


Dear gLExec PPS members,

apologies for none of LHCb attending the meeting this morning.

LHCb started 10 days ago its round of tests with gLExec and this mail is aimed to summarize this experience.

1. FZK PPS was not publishing in production the endpoint that accepts submissions with Role=pilot and allows to access the batch farm with gLExec clients installed. DIRAC can be modified to point to a non production system but is not painless and we decided to divert to NIKHEF in production and installing the same version of clients fixing the error code issue [...].


2. First tests there were failing: 
LCMAPS failed, more info should be within the sys log.

The reason as explained by Oscar was:  
"... the NFS root squash affected the reading of the $X509_USER_PROXY when being effective root at that moment (squashing to nobody:nobody). The tests performed at NIKHEF were not done as root, but as the user account of the Sysadmin."

 A new version of the LCMAPS SCAS Client  (patch #2829) for the Savannah
#48093 fixed the problem.

3. 
A second round of tests was however still due to fail. And this because it seems that running anything other than a standard executable (ex. a Dirac script shipped with the pilot and available in the pilot shell) requires to change the ownership recursively of the parent directories of the pilot job workdir *and also* to relax the umask (at NIKHEF set to 077) before the workload owner is sudoed. 
Although we thought this had to be a business of gLExec, this has been implemented in the pilot framework.

4.
Another round of tests the 18th of March, finally managed to succeed and a script in the pilot working area run finally with different credentials.
However the Role=production was mapped to ordinary pool account and this was not good
uid=47055(lhcb055) gid=2004(lhcb) context=user_u:system_r:initrc_t This has been indeed recognized as a problem in the generation of the gridmapdir by YAIM for the target users. Fixed by Ronald.

5.
Then we can finally claim that "job 8044 in the Development system is the first to successfully run an LHCb application using glexec at NIKHEF".

Regards,

Roberto and Stuart

Status and results of the development (by developers)

Not represented

Open Issues (by VOs, sites, deployment teams)

None

Decision about termination/extension of the pilot

Maxim hypothesizes that from the Atlas perspective everything should be ready and verified by the first half of April. If no major issues are found affecting the integration of PanDA what still would need to be verified with the sites is the correct behaviour of SCAS configured in load balancing. With Easter in the middle it seems prudent to set the new end date for the pilot and the green light for the release of the glexec/SCAS suite to production to the 30th of April (to be cross-checked with LHCb).

AOB

Next events

The next check-point meeting will be held on April the 8th at 16:00. It is national holiday in France but Pierre will connect. Angela apologises


Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2009-03-23 - AntonioRetico
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback