SCAS Pilot Service: Home Page


  • Start Date: 5 Feb 2009
  • End Date: 15 Jul 2009
  • Description: Pilot Service of glexec/SCAS @ FZK, IN2P3, Nikhef
  • Coordinator: Antonio Retico, Gianni Pucciani
  • Contact e-mail: egee-pps-pilot-scas@cern.ch
  • Status : Closed

Description

Pilot contact list egee-pps-pilot-scas@cern.ch

Overall Planning

  • Use cases:
SLC5 resources will gradually replace SLC4 resources. It is planned to install new worker nodes arriving at CERN with SLC5.

  • Objective and Metrics:

  • Timelines:

  • Initial Planning:
    pilot-SCAS.gif

  • Updates to planning:
  1. -Feb : the ramp-up to FZK and IN2P3 production is rescheduled and shifted of one week (to be started on the 6th of March)

Technical documentation

Installation Documentation

Yum repo:

SCAS service

Worker Node

Computing Element

Configuration instructions

Both for the SCAS service and GLEXEC, YAIM modules are available:

For more fine tunings, the relevant configuration files are:

  • scas.conf (default location: /opt/glite/etc/scas.conf)
  • glexec.conf (default location: /opt/glite/etc/glexec.conf)

In the preliminary phase a standard lcg-CE has to be configured to access the GLEXEC-enabled WNs.
The resource info provider on that CE should be set-up to publish a value of glueCEStateStatus = "PilotSCAS"
This can be achieved as follows:

  • Define in your site-info.def a variable called for example, CE_STATE, like this:
    CE_STATE= Special
       
  • In config_gip_ce, Modify the glite-info-dynamic-ce script content by running the following command:
      cat << EOF  > ${INSTALL_ROOT}/glite/etc/gip/plugin/glite-info-dynamic-ce
    #!/bin/sh
    
    $plugin | sed -e 's/GlueCEStateStatus: Production/GlueCEStateStatus: ${CE_STATE}/'
    EOF
       
You can modify the GlueCEStateStatus by changing the value of CE_STATE in site-info.def as many times as you want by running,
./yaim -r -s site-info.def -n lcg-CE -f config_gip_ce

General documentation (user guides)

Configuration requirements for sites supporting Atlas

  • if a myproxy server is used to pass the credentials, myproxy-logon has to be installed on the WN
  • if a plain proxy is retrieved, and adding voms attributes on the WN is needed, the vomses file has to be reachable from the WN.
  • both the roles atlas:/atlas/Role=production and atlas:/atlas/usatlas/Role=pilot need to be enabled to submit to the queue

Release Management

Method used to distribute the software over the pilot .

  1. The Developer (e.g. Oscar at NIKHEF) make a sw change to be distributed in the pilot
  2. Oscar describes the changes (rpm versions, ETICS tags, configuration changes new features) in an open patch in state "with provider"
  3. Oscar requests the repository manager (Danilo at CNAF) to include the new rpm in the YUM repository giving him the patch as release notes
  4. Danilo synchronises the repository and forwards Oscar's message to the sites to upgrade
  5. when Oscar and/or the Pilot is satisfied with the level of the new patch, this is moved to certification following the usual process.

Important to notice that the statement one modification --> one patch --> one release is not true in this model, which sounds more like

Several modifications --> several pseudo-releases to the Pilot --> One patch --> one release

Pilot Layout

FZK

The dedicated CE for this pilot is ready, SCAS and the WNs are configured. The CE test-mw-2-fzk.gridka.de

can be used by

  • lhcb:/lhcb/Role=pilot (queue lhcbXXL)
  • atlas:/atlas/Role=production + atlas:/atlas/usatlas/Role=pilot (queue atlasXXL)
  • cms:/cms/Role=production (queue cmsXL)
  • alice:/alice/Role=pilot (queue aliceXL)

The queues are configured to use the production WNs and will be accounted for FZK-LCG2. The jobs are handled as user jobs for the batch system fair-share of 80:20 (Production:User)

If the experiments want to use different Roles, please let the administrator know. The CE-Status is set to SCASPilot and is currently published in production. SITE_NAME=FZK-LCG2

Nikhef EL-Prod

The production site is installed on the Nikhef Production site. All the WNs are installed with gLExec and configured with multiple SCAS endpoints for fault tolerance. All production CEs are configured to use WNs with gLExec.

The gLExec can be used by:

  • lhcb:/lhcb/Role=pilot
  • atlas:/atlas/Role=production

The gLExec installed on the production system follows Patch #2829 since Friday Feb 27 (afternoon).

IN2P3

Two T1 CEs are configured to use the GLEXEC_WN setup for LHC:

  • cclcgceli03.in2p3.fr
  • cclcgceli04.in2p3.fr
This is available for any user with the role "/lhcb/Role=pilot".

Lancaster

CE is setup to accept ATLAS pilot jobs using these VOMS roles:

  • "/atlas/Role=pilot/Capability=NULL"
  • "/atlas/Role=pilot"
  • "/atlas/usatlas/Role=pilot/Capability=NULL"
  • "/atlas/usatlas/Role=pilot"

Pilot pool accounts are whitelisted for gLExec and LCMAPS vomspoolaccount plugin handles the user mapping pending release of 64bit build of SCAS plugin. The grid-security directory is NFS mounted on WNs from the gatekeeper and rpms are downloaded directly from the ETICs repository.

Tasks and actions:

Actions for SA1 are tracked via the TASK:8986 available from the PPS task tracker

Tasks for other participants are tracked here

Assigned to Due date Description State Closed Notify  
AndyElwell 2009-06-11 follow-up comments #63 ... #65 of PATCH:2973

Update 23 Jun

patch certified

2009-06-23 edit

Assigned to Due date Description State Closed Notify  
JoseCaballero 2009-06-05 open a bug for environment lost after calling glexec

Update 23 Jun

BUG:51854 was opened by Roberto Santinelli --> close action

2009-06-23 edit

Assigned to Due date Description State Closed Notify  
JoseCaballero 2009-06-05 to open a bug for the glexec failure in case of expired voms proxy attribute (or to re-open BUG:41472)

UPDATE 23 Jun:
BUG:41472 was re-opened. Analysis still in progress

action closed. will be tracked with the bug

2009-06-23 edit

Assigned to Due date Description State Closed Notify  
GianniPucciani 2009-04-29 to provide tarball WN out of PATCH:2973

Update 23 Jun

action obsoleted by comments in BUG:48966 --> closing action

edit

Assigned to Due date Description State Closed Notify  
AntonioRetico 2009-05-12 follow-up glexec for SLC5 with EMT

Update 23 Jun

the software now builds correctly a PATCH will be released soon and it will follow the standard path --> closing action

2009-06-23 edit

Assigned to Due date Description State Closed Notify  
MischaSalle 2009-04-28 to provide PATCH:2973 with fix of incompatibilities with CREAM 2009-05-12 edit
AntonioRetico 2009-04-28 register lancaster as a member of this pilot 2009-05-11 AntonioRetico   edit
OscarKoeroo 2009-02-10 to provide a fix to better detail error codes of glexec. Note: it would be useful to the interfacing systems if the specification of the interface could be provided at lease a week earlier 2009-03-05 GianniPucciani   edit

Results

Feedback from the experiments

Atlas: first findings

  • the installation of gLExec/SCAS at gridka works fine (at least for pilot with role=/usatlas/role=pilot
  • if a myproxy server is used to pass the credentials, myproxy-logon has to be installed on the WN
  • if a plain proxy is retrieved, and adding voms attributes on the WN is needed, the vomses file has to be reachable from the WN.
  • Seems like new versions of myproxy-logon retrieve already an "old" style proxy, so GT_PROXY_MODE=old on the WN is not needed (I am not completely sure about this, but I think so)
    R(Maarten):AFAIK MyProxy will delegate a proxy compatible with what was stored, so GT_PROXY_MODE=old should only be needed for uploading .
  • Old problems with proxies delegated to the myproxy server with voms attr, expired before retrieval, have been fixed. The regenerated proxy on the WN is not refused by gLExec

Open points:

  • Has people realized that when gLExec is invoked, the environment (belonging the pilot) vanishes?
  • Has people realized that when gLExec is invoked the current directory is moved to the new user HOME directory?
  • Has people realized that most probably the new user has no permissions to execute programs in the pilot directories? And of course, no permissions to write (i.e. the output and log files than then the pilot will be looking for...)

Some other issues are explained in this paper: gLExec and MyProxy integration in the ATLAS/OSG PanDA Workload Management System

Atlas: further tests at LANCASTER

A second run of testing was run against LANCASTER on June/July that brought :

  • re-opening of BUG:41237 (glexec refuses doubly limited proxies), fixed with PATCH:2990 for LCMAPS
  • BUG:52417 (segfault when the proxy file is corrupt), fixed by a patch in lcas (glite-security-lcas-1.3.11-2)
  • BUG:41472 (glexec fails if valid VOMS proxy also contains expired extensions);fix certified (PATCH:3084)

Atlas confirmed that they were able to run successfully at LANCASTER on the 6th of July.

LHCb

LHCb run progressively tests against Nikhef, FZK and Lancaster.

This activity brought at the opening of BUG:51854 (Environment lost after running glexec)

A summary of the activity at LANCASTER was forwarded at he end on June

From: Roberto Santinelli
Sent: Wednesday, July 01, 2009 9:20 AM
To: egee-pps-pilot-scas (SCAS Pilot Service)
Cc: Ricardo Graciani Diaz; Philippe Charpentier; Andrei Tsaregorodtsev
Subject: First summary of LHCb tests on gExec

Dear Angela and Peter, thanks again for having managed to have this first round of "slightly more than" trivial tests from LHCb passing (both at GridKA? and Lancaster).

My impressions.

I think that a first message that has to pass through is that it is not so immediate and obvious to configure gLExec/SCAS for a given VO on a site; this is true even if the site had already configured it well for another VO. I'm sure that this becomes even less immediate if special customizations are required too. We had to interact several times (at each site) before getting it working.

A second observation that I am tempted to say is that the new piece of m/w from Oscar works. Nonetheless I have not the full evidence of that. I noticed indeed that for both GridKA? and Lancaster (while it was not the case at Lyon) there was not really the need to invoke it. Non built-in commands like voms-proxy-info were indeed available in the payload shell irrespectively of this script.

I would now pass the ball to Ricardo for a more exhaustive test through the DIRAC development system in order to check the integration and the effective use case for LHCb. He will require to modify slightly the pilot wrapper in order to incorporate this script as per instruction available at https://www.nikhef.nl/pub/projects/grid/gridwiki/index.php/GLExec_Environment_Wrap_and_Unwrap_scripts

Regards,

R. 

A further statement from LHCb was sent in August


  1. [The pilot] cannot be considered closed successfully (it does not mean that LHCb wants to keep it open) because we had not possibility to test
IN2p3 installation despite the big effort that Pierre put in place and despite the three bugs about the deployment over non-root WN installations

2. LHCb wants to stress that the generic pilot model is something DIRAC heavily relies on and, even if gLExec is not installed on the site, provided that the site supports the FQAN '/lhcb/Role=pilot', DIRAC will feel entitled to run generic pilot there.


Comments and issues from operations

FZK

log of installation of WN

  • Installation at FZK:
    • installing missing packages:
      • glite-security-glexec
      • glite-security-lcas
      • glite-security-lcas-interface
      • glite-security-lcas-plugins-basic
      • glite-security-lcmaps
      • glite-security-lcmaps-plugins-basic
      • glite-security-lcmaps-plugins-scas-client
      • glite-security-lcmaps-plugins-verify-proxy
      • glite-security-saml2-xacml2-c-lib
      • myproxy-VDT1.6.1x86_rhas_4-7
    • creating directories and files:
      • mkdir /opt/glite/etc/lcas
      • mkdir /opt/glite/etc/lcmaps
      • mkdir /var/log/glexec/
      • touch /opt/glite/etc/lcas/ban_users.db
    • changing the configuration if needed
      • /opt/glite/etc/glexec.conf: Enable all accounts you want to give permissions to use gLExec (user_white_list). If you want to use syslog you have to change log_destination from file to syslog.
      • /etc/logrotate.d/glexec_wn_lcaslcmaps_log
      • /opt/glite/etc/lcas/lcas-glexec.db
      • /opt/glite/etc/lcmaps/lcmaps-glexec.db
    • ensure the right permissions
      • chown root.glexec /opt/glite/etc/glexec.conf (do not forget after editing the file!)
      • chown root.glexec /opt/glite/sbin/glexec
      • chmod 640 /opt/glite/etc/glexec.conf
      • chmod 640 /opt/glite/etc/lcmaps/lcmaps-glexec.db
      • chmod 640 /opt/glite/etc/lcas/lcas-glexec.db
      • chmod 640 /opt/glite/etc/lcas/ban_users.db
      • chmod 6555 /opt/glite/sbin/glexec

    • Also needed:
      • /etc/grid-security/certificates
      • /etc/grid-security/vomsdir
      • /opt/glite/etc/vomses

IN2P3 The administrator of IN2P3 submitted BUG:50908 about the issues they encountered installing glexec in a farm where multiple versions of the WNs are used at the same time. Furthermore they opened BUG:50912 about missing log from LCMAPS when log_destination is syslog

A dedicated and focused analysis was run by the developers and the site admins at IN2P3 a series of recommendations for the future developments were issued

* glexec should be made independent of the glite-WN:
  - The glexec-WN meta-rpm should list all its dependencies.
    Released meta-rpm includes Globus; edg-mkgridmap should be removed.
  - A new environment variable GLEXEC_LOCATION indicates where glexec and
    its wrappers are located (in $GLEXEC_LOCATION/sbin):

    http://savannah.cern.ch/bugs/?52837

    If the variable is not set, VOs can fall back on what they do now.

* A dependencies tar ball should be provided for relocatable installations.
  A relocatable installation would need to have two directories put into
  /etc/ld.so.conf.d/glexec for the time being (see below):

    $GLEXEC_LOCATION/glite/lib
    $GLEXEC_LOCATION/globus/lib

* A source rpm should be provided for glexec itself (not its dependencies),
  so that it can be recompiled with a different location for the hard-coded
  configuration file.
  - This is needed for relocatable installations, like at IN2P3.

* A future version of glexec could have a RUNPATH (RPATH is obsolete)
  compiled into the executable, obviating the need for trusted directories
  to be named in /etc/ld.so.conf.d/glexec.

* Another approach would be to link glexec statically.
  - Does not deal with the plugins for LCAS/LCMAPS/SCAS.

* Future versions of glexec/LCAS/LCMAPS should be independent of Globus.
  - But VOMS libraries would still be needed.

Comments from the developers

  • gLExec tested in PPS by LHCb and Atlas with successful results
    • Deployed on SL4 32 and 64 bit
      • SL5 64 bit is in certification
    • In direct contact with Lyon to discuss packaging issues.
      • A workaround is installed, working towards a viable long term solution that can benefit more parties then just Lyon
    • Created and deployed gLExec wrapper scripts (Perl & Bash) to preserve the environment variables from Pilot Job FW into Pilot Job Payload
  • Continuously improving gLExec and depending libraries
    • Fixed (considered to be) edge cases.
    • Old bugs are fixed, even in LCAS, LCMAPS and L&L plug-ins *The updates are ready for certification
      • They can be used as drop-in replacements available on all platforms

More information:

List of issues

Issue Reported by Bug(s) Status Open/Closed
Environment lost after running glexec LHCb BUG:51854 Solution based on wrap/unwrap scripts. Certified closed
glexec fails if valid VOMS proxy also contains expired extensions Atlas BUG:41472 fix certified (PATCH:3084) closed
glexec refuses doubly limited proxies Atlas BUG:40822
41237
fix in production closed
Difficult deployment over non-root WN installations IN2P3 BUG:48966
BUG:50908
BUG:52837
Under discussion open

Final assessment

The original set of packages were released to production and new ones were created to address new findings.

All the main issues found by the VOs were fixed or addressed (see table above).

The deployment issues at IN2P3 were discussed, understood, addressed by three bugs (see table above). The decision was made to follow-up this issue in a separate effort between the developers and the site admins.

In conclusion the pilot activity in PPS can be concluded

History

05-Feb-09: pilot started.Minutes of kick-off meeting at PPIslandKickOff2009x02x05

06-Feb-09: pilot repository and configuration instructions prepared

12-Feb-09: first PPS installation available at FZK

18-Feb-09: LHCb started testing at FZK

19-Feb-09: Check-point

  • release of new version of glexec implementing the error codes scheduled for the 20th Feb
  • Atlas will use the installaion at FZK to try the integration of the new error codes
  • ramp-up of FZK production and IN2P3 re-scheduled to start the 6th of March
  • first results of the certification stress tests are now available
  • NiKHEF has upgraded the production system with the current version

25-Feb-09: Production at FZK updated to the new version of gLExec (0.6.6-1)

20-Mar-09: Production at FZK updated to the new version of gLExec (0.6.6-4)

20-Mar-09: Check-point

  • new PATCH:2892 certified (fixes for nfs issue observed at NIKHEF) will be distributed to sites in the pilot
  • PanDA integration tests from Atlas successfully done at FZK. Next step: set-up of a regular test of glexec in PanDA
  • LHCb tried out successfully the submission via Dirac3 at NIKHEF
  • installation at IN2P3 will happen between 23rd and 27th of March. Atlas will integrate PanDA in the following week (end 4th of April). Behaviour of SCAS in load-balancing will be tried out.
  • concern from site admins expressed for compatibility of the current version with SLC5 WNs due to production for the 23rd: tp be followed up
  • pilot end-date moved to 30th of April, target date for the deployment in production
  • minutes at PPIslandFollowUp2009x03x20

8-Apr-09: Check-point

  • installation at IN2P3 was slowed down by incompatibilities with the WN deployment method applied locally
  • While these issues are being fixed the decision was made to involve another centre in the pilot in order to allow LHCb and Atlas to continue testing
  • The new SCAS error codes are now interpreted by LHCb's Dirac3

9-Apr-09: Tarball installation of WN + glexec now available and providedd to IN2P3

24-Apr-09: Check-point

  • installation at IN2P3 in progress
  • Lancaster site from UK joined the pilot

11-May-09: two CEs available at IN2P3

9-Jun-09: scripts to wrap-unwrap the environment under development

29-Jun-09: New version of lcas (glite-security-lcas-1.3.11-2) released to the pilot. It fixes BUG:52417 in lcas (segfault when the proxy file is corrupt)

06-Jul-09 : Atlas confirms that the last patch fixes the issues with VOMS proxies at Lancaster

06-Jul-09 : A technical session was held between IN2P2 and the developers in order to address the issues seen at LYON. The results of the discussion are available at PPIslandFollowUp2009x07x06

06-Jul-09 : The decision was made that the issues at IN2P3 will be followed up with a dedicated task force

15-Jul-09 : pilot closed

7-Aug-09 : further statement received from LHCb

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif pilot-SCAS.gif r1 manage 31.2 K 2009-02-06 - 01:07 AntonioRetico Initial Planning
Edit | Attach | Watch | Print version | History: r42 < r41 < r40 < r39 < r38 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r42 - 2009-11-24 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback