In the preliminary phase a standard lcg-CE has to be configured to access the GLEXEC-enabled WNs.
The resource info provider on that CE should be set-up to publish a value of glueCEStateStatus = "PilotSCAS"
This can be achieved as follows:
Define in your site-info.def a variable called for example, CE_STATE, like this:
CE_STATE= Special
In config_gip_ce, Modify the glite-info-dynamic-ce script content by running the following command:
Configuration requirements for sites supporting Atlas
if a myproxy server is used to pass the credentials, myproxy-logon has to be installed on the WN
if a plain proxy is retrieved, and adding voms attributes on the WN is needed, the vomses file has to be reachable from the WN.
both the roles atlas:/atlas/Role=production and atlas:/atlas/usatlas/Role=pilot need to be enabled to submit to the queue
Release Management
Method used to distribute the software over the pilot .
The Developer (e.g. Oscar at NIKHEF) make a sw change to be distributed in the pilot
Oscar describes the changes (rpm versions, ETICS tags, configuration changes new features) in an open patch in state "with provider"
Oscar requests the repository manager (Danilo at CNAF) to include the new rpm in the YUM repository giving him the patch as release notes
Danilo synchronises the repository and forwards Oscar's message to the sites to upgrade
when Oscar and/or the Pilot is satisfied with the level of the new patch, this is moved to certification following the usual process.
Important to notice that the statement one modification --> one patch --> one release is not true in this model, which sounds more like
Several modifications --> several pseudo-releases to the Pilot --> One patch --> one release
Pilot Layout
FZK
The dedicated CE for this pilot is ready, SCAS and the WNs are configured.
The CE
test-mw-2-fzk.gridka.de
can be used by
The queues are configured to use the production WNs and will be accounted for FZK-LCG2.
The jobs are handled as user jobs for the batch system fair-share of 80:20 (Production:User)
If the experiments want to use different Roles, please let the administrator know. The CE-Status is set to SCASPilot and is currently published in production. SITE_NAME=FZK-LCG2
Nikhef EL-Prod
The production site is installed on the Nikhef Production site. All the WNs are installed with gLExec and configured with multiple SCAS endpoints for fault tolerance.
All production CEs are configured to use WNs with gLExec.
The gLExec can be used by:
lhcb:/lhcb/Role=pilot
atlas:/atlas/Role=production
The gLExec installed on the production system follows Patch #2829 since Friday Feb 27 (afternoon).
IN2P3
Two T1 CEs are configured to use the GLEXEC_WN setup for LHC:
cclcgceli03.in2p3.fr
cclcgceli04.in2p3.fr
This is available for any user with the role "/lhcb/Role=pilot".
Lancaster
CE is setup to accept ATLAS pilot jobs using these VOMS roles:
"/atlas/Role=pilot/Capability=NULL"
"/atlas/Role=pilot"
"/atlas/usatlas/Role=pilot/Capability=NULL"
"/atlas/usatlas/Role=pilot"
Pilot pool accounts are whitelisted for gLExec and LCMAPS vomspoolaccount plugin handles the user mapping pending release of 64bit build of SCAS plugin. The grid-security directory is NFS mounted on WNs from the gatekeeper and rpms are downloaded directly from the ETICs repository.
Tasks and actions:
Actions for SA1 are tracked via the TASK:8986 available from the PPS task tracker
Tasks for other participants are tracked here
to open a bug for the glexec failure in case of expired voms proxy attribute (or to re-open BUG:41472) UPDATE 23 Jun: BUG:41472 was re-opened. Analysis still in progress action closed. will be tracked with the bug
follow-up glexec for SLC5 with EMT Update 23 Jun the software now builds correctly a PATCH will be released soon and it will follow the standard path --> closing action
to provide a fix to better detail error codes of glexec. Note: it would be useful to the interfacing systems if the specification of the interface could be provided at lease a week earlier
the installation of gLExec/SCAS at gridka works fine (at least for pilot with role=/usatlas/role=pilot
if a myproxy server is used to pass the credentials, myproxy-logon has to be installed on the WN
if a plain proxy is retrieved, and adding voms attributes on the WN is needed, the vomses file has to be reachable from the WN.
Seems like new versions of myproxy-logon retrieve already an "old" style proxy, so GT_PROXY_MODE=old on the WN is not needed (I am not completely sure about this, but I think so) R(Maarten):AFAIK MyProxy will delegate a proxy compatible with what was stored, so GT_PROXY_MODE=old should only be needed for uploading .
Old problems with proxies delegated to the myproxy server with voms attr, expired before retrieval, have been fixed. The regenerated proxy on the WN is not refused by gLExec
Open points:
Has people realized that when gLExec is invoked, the environment (belonging the pilot) vanishes?
Has people realized that when gLExec is invoked the current directory is moved to the new user HOME directory?
Has people realized that most probably the new user has no permissions to execute programs in the pilot directories? And of course, no permissions to write (i.e. the output and log files than then the pilot will be looking for...)
A second run of testing was run against LANCASTER on June/July that brought :
re-opening of BUG:41237 (glexec refuses doubly limited proxies), fixed with PATCH:2990 for LCMAPS
BUG:52417 (segfault when the proxy file is corrupt), fixed by a patch in lcas (glite-security-lcas-1.3.11-2)
BUG:41472 (glexec fails if valid VOMS proxy also contains expired extensions);fix certified (PATCH:3084)
Atlas confirmed that they were able to run successfully at LANCASTER on the 6th of July.
LHCb
LHCb run progressively tests against Nikhef, FZK and Lancaster.
This activity brought at the opening of BUG:51854 (Environment lost after running glexec)
A summary of the activity at LANCASTER was forwarded at he end on June
From: Roberto Santinelli
Sent: Wednesday, July 01, 2009 9:20 AM
To: egee-pps-pilot-scas (SCAS Pilot Service)
Cc: Ricardo Graciani Diaz; Philippe Charpentier; Andrei Tsaregorodtsev
Subject: First summary of LHCb tests on gExec
Dear Angela and Peter, thanks again for having managed to have this first round of "slightly more than" trivial tests from LHCb passing (both at GridKA? and Lancaster).
My impressions.
I think that a first message that has to pass through is that it is not so immediate and obvious to configure gLExec/SCAS for a given VO on a site; this is true even if the site had already configured it well for another VO. I'm sure that this becomes even less immediate if special customizations are required too. We had to interact several times (at each site) before getting it working.
A second observation that I am tempted to say is that the new piece of m/w from Oscar works. Nonetheless I have not the full evidence of that. I noticed indeed that for both GridKA? and Lancaster (while it was not the case at Lyon) there was not really the need to invoke it. Non built-in commands like voms-proxy-info were indeed available in the payload shell irrespectively of this script.
I would now pass the ball to Ricardo for a more exhaustive test through the DIRAC development system in order to check the integration and the effective use case for LHCb. He will require to modify slightly the pilot wrapper in order to incorporate this script as per instruction available at https://www.nikhef.nl/pub/projects/grid/gridwiki/index.php/GLExec_Environment_Wrap_and_Unwrap_scripts
Regards,
R.
A further statement from LHCb was sent in August
[The pilot] cannot be considered closed successfully (it does not mean that LHCb wants to keep it open) because we had not possibility to test
IN2p3 installation despite the big effort that Pierre put in place and despite the three bugs about the deployment over non-root WN installations
2. LHCb wants to stress that the generic pilot model is something DIRAC heavily relies on and, even if gLExec is not installed on the site, provided that the site supports the FQAN '/lhcb/Role=pilot', DIRAC will feel entitled to run generic pilot there.
Comments and issues from operations
FZK
log of installation of WN
Installation at FZK:
installing missing packages:
glite-security-glexec
glite-security-lcas
glite-security-lcas-interface
glite-security-lcas-plugins-basic
glite-security-lcmaps
glite-security-lcmaps-plugins-basic
glite-security-lcmaps-plugins-scas-client
glite-security-lcmaps-plugins-verify-proxy
glite-security-saml2-xacml2-c-lib
myproxy-VDT1.6.1x86_rhas_4-7
creating directories and files:
mkdir /opt/glite/etc/lcas
mkdir /opt/glite/etc/lcmaps
mkdir /var/log/glexec/
touch /opt/glite/etc/lcas/ban_users.db
changing the configuration if needed
/opt/glite/etc/glexec.conf: Enable all accounts you want to give permissions to use gLExec (user_white_list). If you want to use syslog you have to change log_destination from file to syslog.
/etc/logrotate.d/glexec_wn_lcaslcmaps_log
/opt/glite/etc/lcas/lcas-glexec.db
/opt/glite/etc/lcmaps/lcmaps-glexec.db
ensure the right permissions
chown root.glexec /opt/glite/etc/glexec.conf (do not forget after editing the file!)
chown root.glexec /opt/glite/sbin/glexec
chmod 640 /opt/glite/etc/glexec.conf
chmod 640 /opt/glite/etc/lcmaps/lcmaps-glexec.db
chmod 640 /opt/glite/etc/lcas/lcas-glexec.db
chmod 640 /opt/glite/etc/lcas/ban_users.db
chmod 6555 /opt/glite/sbin/glexec
Also needed:
/etc/grid-security/certificates
/etc/grid-security/vomsdir
/opt/glite/etc/vomses
IN2P3
The administrator of IN2P3 submitted BUG:50908 about the issues they encountered installing glexec in a farm where multiple versions of the WNs are used at the same time. Furthermore they opened BUG:50912 about missing log from LCMAPS when log_destination is syslog
A dedicated and focused analysis was run by the developers and the site admins at IN2P3 a series of recommendations for the future developments were issued
* glexec should be made independent of the glite-WN:
- The glexec-WN meta-rpm should list all its dependencies.
Released meta-rpm includes Globus; edg-mkgridmap should be removed.
- A new environment variable GLEXEC_LOCATION indicates where glexec and
its wrappers are located (in $GLEXEC_LOCATION/sbin):
http://savannah.cern.ch/bugs/?52837
If the variable is not set, VOs can fall back on what they do now.
* A dependencies tar ball should be provided for relocatable installations.
A relocatable installation would need to have two directories put into
/etc/ld.so.conf.d/glexec for the time being (see below):
$GLEXEC_LOCATION/glite/lib
$GLEXEC_LOCATION/globus/lib
* A source rpm should be provided for glexec itself (not its dependencies),
so that it can be recompiled with a different location for the hard-coded
configuration file.
- This is needed for relocatable installations, like at IN2P3.
* A future version of glexec could have a RUNPATH (RPATH is obsolete)
compiled into the executable, obviating the need for trusted directories
to be named in /etc/ld.so.conf.d/glexec.
* Another approach would be to link glexec statically.
- Does not deal with the plugins for LCAS/LCMAPS/SCAS.
* Future versions of glexec/LCAS/LCMAPS should be independent of Globus.
- But VOMS libraries would still be needed.
Comments from the developers
gLExec tested in PPS by LHCb and Atlas with successful results
Deployed on SL4 32 and 64 bit
SL5 64 bit is in certification
In direct contact with Lyon to discuss packaging issues.
A workaround is installed, working towards a viable long term solution that can benefit more parties then just Lyon
Created and deployed gLExec wrapper scripts (Perl & Bash) to preserve the environment variables from Pilot Job FW into Pilot Job Payload
Continuously improving gLExec and depending libraries
Fixed (considered to be) edge cases.
Old bugs are fixed, even in LCAS, LCMAPS and L&L plug-ins *The updates are ready for certification
They can be used as drop-in replacements available on all platforms
User and SysAdmin How-To, install opportunities, Man-page, Exit codes explanation, links to the Service Reference Card, Batch system interoperability tests and more (to come)
List of issues
Issue
Reported by
Bug(s)
Status
Open/Closed
Environment lost after running glexec
LHCb
BUG:51854
Solution based on wrap/unwrap scripts. Certified
closed
glexec fails if valid VOMS proxy also contains expired extensions
Atlas
BUG:41472
fix certified (PATCH:3084)
closed
glexec refuses doubly limited proxies
Atlas
BUG:40822 41237
fix in production
closed
Difficult deployment over non-root WN installations
The original set of packages were released to production and new ones were created to address new findings.
All the main issues found by the VOs were fixed or addressed (see table above).
The deployment issues at IN2P3 were discussed, understood, addressed by three bugs (see table above). The decision was made to follow-up this issue in a separate effort between the developers and the site admins.
In conclusion the pilot activity in PPS can be concluded
History
05-Feb-09: pilot started.Minutes of kick-off meeting at PPIslandKickOff2009x02x05
06-Feb-09: pilot repository and configuration instructions prepared
12-Feb-09: first PPS installation available at FZK
18-Feb-09: LHCb started testing at FZK
19-Feb-09: Check-point
release of new version of glexec implementing the error codes scheduled for the 20th Feb
Atlas will use the installaion at FZK to try the integration of the new error codes
ramp-up of FZK production and IN2P3 re-scheduled to start the 6th of March
first results of the certification stress tests are now available
NiKHEF has upgraded the production system with the current version
25-Feb-09: Production at FZK updated to the new version of gLExec (0.6.6-1)
20-Mar-09: Production at FZK updated to the new version of gLExec (0.6.6-4)
20-Mar-09: Check-point
new PATCH:2892 certified (fixes for nfs issue observed at NIKHEF) will be distributed to sites in the pilot
PanDA integration tests from Atlas successfully done at FZK. Next step: set-up of a regular test of glexec in PanDA
LHCb tried out successfully the submission via Dirac3 at NIKHEF
installation at IN2P3 will happen between 23rd and 27th of March. Atlas will integrate PanDA in the following week (end 4th of April). Behaviour of SCAS in load-balancing will be tried out.
concern from site admins expressed for compatibility of the current version with SLC5 WNs due to production for the 23rd: tp be followed up
pilot end-date moved to 30th of April, target date for the deployment in production
11-May-09: two CEs available at IN2P3
9-Jun-09: scripts to wrap-unwrap the environment under development
29-Jun-09: New version of lcas (glite-security-lcas-1.3.11-2) released to the pilot. It fixes BUG:52417 in lcas (segfault when the proxy file is corrupt)
06-Jul-09 : Atlas confirms that the last patch fixes the issues with VOMS proxies at Lancaster
06-Jul-09 : A technical session was held between IN2P2 and the developers in order to address the issues seen at LYON. The results of the discussion are available at PPIslandFollowUp2009x07x06
06-Jul-09 : The decision was made that the issues at IN2P3 will be followed up with a dedicated task force
15-Jul-09 : pilot closed
7-Aug-09 : further statement received from LHCb