Glexec

Motivation

Worker nodes on the grid exhibit great diversity, making it difficult to offer uniform processing resources. A pilot job architecture, which probes the environment on the remote worker node before pulling down a payload job, can help. Pilot jobs become smart wrappers, preparing an appropriate environment for job execution and providing logging and monitoring capabilities. PanDA (Production and Distributed Analysis), an ATLAS and OSG workload management system, follows this design. However, in the simplest (and most efficient) pilot submission approach of identical pilots carrying the same identifying grid proxy, end-user accounting by the site can only be done with application-level information (PanDA maintains its own end-user accounting), and end-user jobs run with the identity and privileges of the proxy carried by the pilots, which may be seen as a security risk. To address these issues, we want to unable PanDA to use gLExec, a tool provided by EGEE which runs payload jobs under an end-user's identity. End-user proxies are pre-staged in a credential caching service, MyProxy, and the information needed by the pilots to access them is stored in the PanDA DB. gLExec then extracts from the user's proxy the proper identity under which to run.

Strategy

  • First the end-user credentials are downloaded from a MyProxy server.
  • A wrapper bash script is created where the entire environment is re-setup.Reason for that is after identity switch by gLExec the previously existing environment vanishes.
  • Also a line to move the new current process from the new identity HOME directory to the previous pilot working directory is included in this wrapper script.
  • And finally the actual payload command (buildJob or runAthena) is included in the wrapper.
  • After creating this wrapper and modifying permissions to directories and files previously created in order to allow the new identity to read/write/execute files, gLExec is finally invoked to switch identity and run the wrapper under the new user.

Problems and issues.

There are some problems when using gLExec, mostly associated with the fact that the new user has no privilege anymore to read/write/execute files anywhere. So the old files and directories created by the pilot are almost always forbidden territory. Allowing the new user to write in the pilot working directory, in order to be able to generate the output root files and logs, is granted just by changing permissions (as commented in section Strategy). But there are a few more issues:

  • The python_egg_cache. With recent Athena releases, at some sites it tries to unpack some python libraries. By default, setuptools tries to unpack the python libs in ~/.python_eggs. Instead of that, we are using the environment variable PYTHON_EGG_CACHE, making it to point to a random directory underneath /tmp/. Comment, the libraries can be unpacked at the installation time by using flag -Z.

  • If a random directory underneath /tmp/ needs to be created, is not a good idea to use mktemp. Reason is that this type of commands make use of the value of the environment variable $TMPDIR, and there are sites where that variable is redefined by the local batch system to different directories where the new user may not have write permissions.

  • With CREAM CE, by default, new directories and files are created with umask 0077, instead of the usual 0022. We are adding the command umask u=rwx,g=rwx,o=rwx to the glexec wrapper to facilitate the files created by the new user to have good read permissions so the pilot can stage them out.

  • At some sites, the environment variable LD_PRELOAD points to a non-existing file. This makes gLExec to throw some content to the stderr, and this causes python interpreter to crash, unless the shell command gLExec is invoked using commands.get*output() method. But these methods are being deprecated. In order to be able to user subprocess, we are explicitly preventing the environment variable LD_PRELOAD from being inserted in the gLExec wrapper.

PanDA queues with gLExec enabled

queue site comments HammerCloud tests
ANALY_BNL_GLEXEC ANALY_BNL_GLEXEC pilots submitted by hand OK
ANALY_TEST-APF ANALY_TEST-APF pilots submitted with AutoPyFactory OK
ANALY_GLASGOW_GLEXEC ANALY_GLASGOW_GLEXEC pilots submitted by hand --
ANALY_OXFORD_GLEXEC ANALY_OXFORD_GLEXEC pilots submitted by hand --
ANALY_GLEXEC_TRIUMF ANALY_GLEXEC_TRIUMF pilots submitted by hand --
ANALY_CERN_GLEXEC ANALY_CERN_GLEXEC pilots submitted by hand OK


Major updates:
-- JoseCaballero - 14-Mar-2011



Responsible: JoseCaballero

Never reviewed

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r5 - 2011-04-20 - JoseCaballero
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback