Deployment of gLExec on the Worker Node

Applicability

The information on this page is relevant for WLCG sites with CREAM or OSG-CE services (the ARC-CE presently is not concerned). Installation and configuration advice is given for EMI middleware. USATLAS and USCMS sites are asked to coordinate with their respective T1 sites instead. It is understood that for various reasons some sites may be unable to comply with the guidelines detailed below; the affected sites may need to be excluded from use by Multi User Pilot Jobs in the future.

Background

Each of the experiments has a framework that supports the submission of so-called pilot jobs:

  • ALICE - AliEn
  • ATLAS - PanDA
  • CMS - GlideinWMS
  • LHCb - DIRAC
On the Worker Node (WN) a pilot job deploys a pilot agent that contacts the task queue managed by the VO's framework, to obtain the highest-priority task (a.k.a. payload) that is compatible with the WN environment.

A pilot job can either be "single-user" (a.k.a. "private"), i.e. run only payloads submitted with the same credentials as its own, or "multi-user", in which case it may run payloads submitted by any user authorized by the VO.

Multi-user pilot jobs (MUPJs) can only be submitted by a small group of people in each experiment, essentially production managers, using a specific VOMS role (Role=pilot).

When a payload has finished and the pilot job slot has sufficient CPU and wallclock time left, another task might be downloaded, possibly submitted by a different user, etc.

Various sites were uncomfortable with the idea that the users who submitted such tasks would not be identifiable by the site in case forensic investigations are needed after an incident.

This led to the idea of introducing a mechanism allowing sites to exercise authorization and obtain traceability within MUPJs. The proposed implementation depends on a command that allows the pilot job to imitate the CE to a certain extent.

That command is "glexec" and is similar to Apache's "suexec". Glexec has two running modes. In the "log-only" mode it logs the authorization decision and runs a given payload under the identity of the invoking pilot agent. In the "identity-changing" mode it logs the authorization decision, maps the payload proxy and runs the payload under the corresponding account. In the latter case glexec needs to be setuid root.

More details at: https://wlcg-tf.hep.ac.uk/wiki/Multi_User_Pilot_Jobs

Time lines

During the WLCG Management Board meeting of Feb 8 2011 it was decided that CERN and the T1 sites should have gLExec on the WN working by the end of March 2011, while for T2 sites the aim was the end of June 2011.

Subsequent discussions then led to a suspension of the deployment until the matter of Multi User Pilot Jobs had been revisited in the WLCG Technical Evolution Groups on Security and Workload Management. The outcome is that currently the use of gLExec is the only viable method for user separation in Multi User Pilot Jobs and that its deployment should hence continue. This has recently been confirmed in the Management Board meeting of May 14, 2013, as summarized in the minutes. As of March 2016 the deployment efforts have been concluded, see here.

The deployment status is tracked on a separate page.

The status of the experiment frameworks:

  • ALICE - adaptations of AliEn under study.
  • ATLAS - implementing use of gLExec by the PanDA pilot.
  • CMS - transparent usage available through GlideinWMS.
  • LHCb - usage configurable in DIRAC.
Further details at:

How to implement gLExec on the WN

This section pertains to EMI / UMD middleware. USATLAS and USCMS sites should consult with their respective T1 sites.

By default sites are advised to take EMI products from the EGI UMD, where products appear after passing a Staged Rollout phase (to which any EGI site can contribute):

For urgent bug fixes or features one can use the EMI repositories directly:

Please consult the current documentation for the various components; some of the links mentioned below may be out of date.

Check list

On the CE the pilot role needs to be configured for each supported experiment and for the "ops" VO

The roles should be mapped to separate sets of accounts that will be put into the gLExec "whitelist" (accounts that are allowed to run the "glexec" command). Examples are shown here:
https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#Notes_on_using_SCAS_and_ARGUS

An Argus server should be set up

    • Take Argus from EMI or the EGI UMD as discussed above.
    • https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#ARGUS
    • Note: the Argus server may need to share the "gridmapdir" with the CEs to guarantee that a particular proxy will always be mapped to the same account. Beware that the Argus server needs to have write access to that directory. An alternative would be for Argus and the CEs to use non-overlapping sets of accounts. CREAM now supports the use of Argus also on the CE level ( recommended), which allows the gridmapdir to be present only on the Argus server.
    • Note: even if the CE restricts the groups and roles that are supported, the Argus WN policy must accept ALL groups and roles of the affected VOs!
      Example "groups.conf" for some WLCG VO:
          "/vo/ROLE=lcgadmin":::sgm:
          "/vo/ROLE=production":::prd:
          "/vo/ROLE=pilot":::pilot:
          "/vo/*"::::
          "/vo"::::
                 
      YAIM does not (yet) configure the corresponding policies in the PAP, but a reasonable policy can easily be derived from the groupmapfile by running a script created by Antonio Delgado of CIEMAT:
      http://wwwae.ciemat.es/~delgadop/from-groupmap-to-policy.sh
      The output can be saved in a file that can be fed to the PAP as follows:
          sh  from-groupmap-to-policy.sh  >  my-policy.spl
          pap-admin  add-policies-from-file  my-policy.spl
                 
    • Note: while the number of explicitly recognized groups and roles could be reduced compared to the configuration of the CE (that is one of the benefits of using pilot jobs), imitating the CE configuration is easy and usually sufficient.

On the EMI-WN the "emi-glexec_wn" meta package should be installed.

Configure it (preferably in "setuid" mode) according to the YAIM documentation:

Option to install gLExec without Argus

    • In this method, central banning can be accomplished as described here, with a ban file that may be kept in one location, e.g. on NFS
    • If log-only mode is used, a gridmapdir is not needed, and the glexec executable does not need to have the setuid bit set
      • If setuid mode is used (recommended for traceability), the gridmapdir ought to be shared between the WN and the CE(s), to ensure consistent mappings and thereby avoid that users can access files (including proxies) of others
    • An example glexec.conf file for the log-only configuration without Argus:
      [glexec]
      log_level                    = 3
      user_white_list              = .somegroup
      linger                       = yes
      user_identity_switch_by      = glexec
      use_lcas                     = no
      lcmaps_debug_level           = 3
      lcmaps_get_account_policy    = glexec_get_account
      log_destination              = syslog
            
    • An example lcmaps-glexec.db file for the log-only configuration without Argus:
      path = /usr/lib64/lcmaps
      verify_proxy = "lcmaps_verify_proxy.mod" 
                     " -certdir /etc/grid-security/certificates/"
                     " --allow-limited-proxy"
      
      good = "lcmaps_dummy_good.mod" 
             " --dummy-username nobody" 
             " --dummy-group nobody" 
             " --dummy-sec-group nobody" 
      
      ban_dn = "lcmaps_ban_dn.mod"
             "-banmapfile /somewhere/ban_users.db"
      
      ban_fqan = "lcmaps_ban_fqan.mod"
             "-banmapfile /somewhere/ban_users.db"
      
      glexec_get_account: 
      verify_proxy -> ban_dn
      ban_dn -> ban_fqan
      ban_fqan -> good
      

Known issues

  • glexec < 0.9.9 aborts when the environment contains MALLOC_ variables
    • fix released in 0.9.11

Monitoring of gLExec tests

For CREAM services registered in the GOCDB (EGI and direct partners) the "ops" VO can submit hourly test jobs with "/ops/Role=pilot" as the primary attribute in the proxy and executing a simple test of the "glexec" command on the WN, using its own proxy as the "payload" proxy: no identity change shall occur, but all other aspects of the setup are thus tested. The tests used to be submitted centrally, but that functionality has been decommissioned early April 2012, see the notes below.

Already since Update 10 of the SAM-Nagios software these tests can be enabled in the ROC/NGI profile, such that sites will be able to monitor the results along with those for other tests:

Note: such automatic glexec tests only cover CE services that have also been declared with a "gLExec" type in the GOCDB.
Note: please declare your CE services that should receive the glexec tests!

To see the OPS gLExec test results for all NGI/ROC instances:

LHCb also run such tests:

CMS have developed a more elaborate test for which the results are published here:

ATLAS test results are published here:

ALICE test results are published here:

More information

-- MaartenLitmaath - 13-Apr-2011

Edit | Attach | Watch | Print version | History: r48 < r47 < r46 < r45 < r44 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r48 - 2016-04-25 - FabioMartinelli
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback