glexec/Argus Pilot Service: Home Page


  • Start Date: Tue 24 Nov 2009
  • End Date (tentative): 12 April 2010
  • Description: Pilot Service of glexec/Argus @ FZK, SWITCH, CESNET, SRCE, INFN
  • Coordinator: Antonio Retico
  • Contact e-mail: egee-pilot-argus@cern.ch
  • Status : In Progress
  • Related meetings

Description

Use cases

  • Experiment framework using glexec for production pilot jobs.
  • Test of grid-wise banning feature by OSCT
  • Gathering of requirements and analysis for monitoring tools

Objective and metrics

Objective:

  1. Chain glexec - Argus demonstrated to interact correctly with LHC Exepriments' frameworks for pilot jobs
  2. Maintenance and operations of the Argus service declared supportable by the sites
  3. OSCT able to ban a user on the whole pilot infrastructure without specific intervention of the site administrators
  4. Collection of exhaustive requirements for the implementation of monitoring tools

Planning

Initial plan

Task Owner Start Date Due Date Status
Set-up repositories and documentation SA1, SA3, CNAF 23-Nov-09 24-Nov-09 Done
Preliminary installation (ARGUS, WN, CE) SWITCH 25-Nov-09 27-Nov-09 Done
Core installations (ARGUS, WN, CE) FZK, SRCE, CESNET, CNAF 30-Nov-09 10-Dec-09 In progress

Constraints and milestones

  • kick-off with sites: 25-Nov (11 AM CET)
  • 1st site technically available for Experiments to test (SWITCH): 1-Dec
  • kick-off with experiments: 1-Dec (11 AM CET)
  • All sites technically available for Experiments to test: 15-Jan
  • Indicative start of Alice developments to integrate glexec: 18-Jan
  • Indicative start of CMS developments to integrate glexec: 15-Feb
  • END of activity (proposed): 31-Mar

Technical documentation

Installation Documentation

Yum repo:

Argus service

Worker Node

Computing Element

  • Repository URL : Production repository (I hope)
  • INFO :

Configuration instructions

Both for the Argus service and GLEXEC, YAIM modules are available:

For more fine tunings:

Post configuration tests

In order to test the correct deployment of Argus, after the installation/configuration some basic tests can be done using the pap-admin to store/list/update/remove policies. After this, the pepcli can be used to test authorization requests/responses. pap-admin and pepcli are documented in the Argus main twiki.

In order to test the interaction glexec-Argus do something like this from a whitelisted account on the Worker Node:

export X509_USER_PROXY=<target_proxy>
export GLEXEC_CLIENT_CERT=${GLEXEC_CLIENT_CERT:-$X509_USER_PROXY}
$GLITE_LOCATION/sbin/glexec /usr/bin/whoami
And verify that the returned user is the mapped one.

Configuration requirements for sites supporting Atlas

  • if a myproxy server is used to pass the credentials, myproxy-logon has to be installed on the WN (it should be the default in production by now)
  • if a plain proxy is retrieved, and adding voms attributes on the WN is needed, the vomses file has to be reachable from the WN.
  • both the roles atlas:/atlas/Role=production and atlas:/atlas/usatlas/Role=pilot need to be enabled to submit to the queue

General documentation (user guides)

Test documentation


Summary of the load and aging tests done before the certification

  • Load tests:
    • Service Host: 1x 2.33GHz CPU, 1gig ram
    • client recreated - simulates what glexec would do
      • ~60 req/sec, ~160ms (limited by spawning processes)
    • client reused - simulates what CREAM/WMS would do
      • ~240 req/sec, ~120ms
    • client reused, repeat request - simulates pilot jobs
      • ~1000 req/sec, ~37.6ms
  • Aging tests:
    • Test operation over several days with several mio requests
    • Memory usage: stable



Pilot Layout

SWITCH

Argus one virtual machine with SL5 64 bit, installed from the following repository

http://grid-deployment.web.cern.ch/grid-deployment/glite/cert/3.2/patches/3076/sl5/$basearch/ (an alternative is now available here http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/ARGUS/argus/sl5/x86_64/)

CE One lcg-CE (diana.switch.ch) with two WNs (SL5 64 bit): all as vmware virtual machines. VOs enabled: dteam,dech,ops,atlas (no atlas software installed). The WNs were installed pointing to the following repository

CE endpoint http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/ARGUS/glite-WN/sl5/x86_64/

FZK-LCG2

Argus one virtual machine with SL5 64 bit

CE All CEs at GridKa are usable but please refer to cream-3-fzk.gridka.de. The queue must be "pps". All software installations are available on the PPS WNs. Enabled VOs: alice, atlas, cms, dteam, lhcb and ops.

WNs The PPS cluster has been extended to 300 cores

For alice a separate VOBox is available: test-mw-fzk.gridka.de

CNAF

Argus one virtual machine on SL5/64bit

CE One CREAM CE with two virtual WNs (SL5 64 bit). VOs enabled: dteam, infngrid, ops.

CE endpoint devce.cnaf.infn.it:8443/cream-pbs-cert

Results

Feedback from the experiments

Comments and issues from operations

SWITCH

The instructions to manually install the Argus compatible WNs are wrong. It is recommended that yaim be used instead.

FZK / KIT

reduction of the policy refresh time from 4hours to 15 mins requested: Angela opened the bug
https://savannah.cern.ch/bugs/index.php?62281

List of issues

Issue Reported by Bug(s) Status Open/Closed
(Affects glexec on WMS-->GLEXEC-->CREAM chain): Wrongly configured GLITE_LOCATION makes sometimes impossible the discovery of the glexec executable CERN BUG:62810)fixed with patch 3760 (with provider) open
The default policy refresh time set to 4hours seems too long KIT BUG:62281 To be discussed with JSPG open
PEPd should require client-cert authentication support for connecting pep clients CNAF-T1 BUG:60041 fixed with patch 3536 In certification open

Recommendation for Deployment in production

Final assessment

Tasks and actions:

Actions for SA1 are tracked via the Sa1 Deployment task tracker

Sign in with your CERN account
Reminder: you have agreed to comply with the CERN computing rules, in particular OC5. CERN implements the measures necessary to ensure compliance.

Use credentials

Password
WebForm_PostBackOptions("ctl00$ctl00$NICEMasterPageBodyContent$SiteContentPlaceholder$btnFormsLogin", "", true, "credentials", "", false, false))" id="ctl00_ctl00_NICEMasterPageBodyContent_SiteContentPlaceholder_btnFormsLogin" class="button_signin" />
Need password help ?
   

Use one-click authentication

Kerberos based authentication Sign in using your current Windows/Kerberos credentials [autologon]
Use your current authentication token. You need Internet Explorer on CERN Windows or Firefox on SLC (Firefox help here).

Certificate authentication http://cern.ch/ca" href="http://egee-pre-production-service.web.cern.ch/egee-pre-production-service/MPS//auth/sslclient/?wa=wsignin1.0&wreply=https%3A%2F%2Fegee-pre-production-service.web.cern.ch%2FShibboleth.sso%2FADFS&wct=2020-12-01T13%3A03%3A54Z&wtrealm=https%3A%2F%2Fwebafs732.cern.ch%2FShibboleth.sso%2FADFS&wctx=cookie%3A1606827834_d770">Sign in using your CERN Certificate http://cern.ch/ca." href="http://egee-pre-production-service.web.cern.ch/egee-pre-production-service/MPS//javascript:__doPostBack('ctl00$ctl00$NICEMasterPageBodyContent$SiteContentPlaceholder$hlCertificateAutologon','')">[autologon]
You can get a CERN certificate on the CERN Certification Authority website.

Use strong two factor authentication [show]

Sign in with a public service account

Facebook, Google, Live, etc.
Authenticate using an external account provider such as Facebook, Google, Live, Yahoo, Orange.

Sign in with your organization or institution account

Federation authentication

[show debug information]

Tasks for other participants are tracked here

Open

Assigned to Due date Description State Closed Notify  
GiuseppeMIsurelli 2010-02-19 Provide a report describing the issues being faced by CNAF for the installation of glexec on the WNs.

INFN-T1 is experiencing a problem on the stability of GPFS interacting with the WN on demand system adopted locally into the resource center.
Since they decided to provide virtual WNs for the pilot, the issue is affecting consequently the deployment of the glexec WN component into the site.

2010-02-19 MaartenLitmaath edit

Assigned to Due date Description State Closed Notify  
GianniPucciani 2010-02-19 Provide functional specification of glexec tests being implemented at SRCE     edit
ChadLaJoie 2010-02-03 Provide instructions on how to preserve local policies during the upgrade of the Argus server to a newer version both in an e-mail to the sites and in the PATCH:3536

this was done on the 2nd of February
This is done now at https://savannah.cern.ch/patch/?3536

2010-03-02 MaartenLitmaath edit

Closed

Assigned to Due date Description State Closed Notify  
AngelaPoschlad 2010-02-03 Open a bug to request the reduction of the policy refresh time from 4hours to 15 mins

3-2-10: Angeal opened the bug
https://savannah.cern.ch/bugs/index.php?62281

2010-02-05 MaartenLitmaath edit
AntonioRetico 2009-12-18 Provide the timeline for an installation of a reasonable scale (>100WNs) to be available to Atlas in order to test glexec in production

Update 18-Dec (Andrea Ceccanti) :
Converging on CANF offering the first large-scale installation. They are currently working to the installation at the T! and they hope to have finished before Christmas or alternatively by the 6th of January in order to be ready by the 15th. If the preliminary tests now undergoing suceed they are Ok to use Argus 1.1 whenit will be in status "Ready for Certification"

2009-12-18 AntonioRetico   edit
Main.SWITCH, Main.NIKHEF, Main.SA3 2009-12-01 Finalise the YAIM configuration for Argus -compatible GLEXEC_WN 2009-12-18 AntonioRetico   edit
GianniPucciani 2009-12-04 enumerate available deployment scenarios and see whether new developments have to be requested (or re-negotiations are needed with the sites)

Update 26-Nov.
After discussion with JRA1 and SA3 it was proposed to extend the support of the clients on SL4 . A new patch has been requested to the developers
Antonio

Update 1-Dec
During the last meeting Gianni was put in charge to open the bug with the change request

Update 18-Dec (Gianni) :
All new developments are now tracked by bugs

2009-12-08 MaartenLitmaath edit
GianniPucciani 2009-12-01 provide reference for basic testing for site administrators in the twiki

Update 1-Dec :
info now available in #Post_configuration_tests

2009-12-03 MaartenLitmaath edit
AngelaPoschlad 2009-12-01 reply to proposed timelines for FZK

Angela confirmed that staring on the 30th is fine for her

2009-11-26 MaartenLitmaath edit

History

30-Mar-2010 : Check point (PPIslandFollowUp2010x03x30):

  • Argus 1.1 server part in "Ready for Rollout"

17-Mar-2010 : Check point (PPIslandFollowUp2010x03x17):

  • Installation at CNAF T1 finished
  • Decision to use the production repository for future operations
  • Testing of the OSCT global banning list approved.
  • Pilot end date shifts to the 16th of April
  • Further developments and tests to be followed within the GDB

16-Feb-2010 : Check point (PPIslandFollowUp2010x02x16):

  • Installation at KIT/FZK scaled-up to 300 cores

2-Feb-2010 : Check point (PPIslandFollowUp2010x02x02):

  • All sites will be soon requested to upgrade to the new version of Argus PATCH:3536 . CNAF-T1 will be the first, the other will follow
  • All sites requested to apply the workaround in BUG:62206 in order for the Argus servers to star being published in the information system.
  • Integration works in progress for Alice
  • Integration works confirmed to start at mid February for CMS

18-Dec-2009 : Check point (PPIslandFollowUp2009x12x18):

  • Testing of Argus version 1.1. in progress at CNAF
  • installation in progress at all sites. Platform expected available by the 15th of Dec

1-Dec-2009 : Fist installaiton at SWITCH available for testing

1-Dec-2009 : kick-off with the experiments (PPIslandKickOff2009x12x01)

25-Nov-2009 : kick-off with sites (PPIslandKickOff2009x11x25)

24-Nov-2009 : Pilot Home page created

Edit | Attach | Watch | Print version | History: r35 < r34 < r33 < r32 < r31 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r35 - 2016-07-05 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback