Generic PanDA Pilot

Introduction

This document aims to guide the development of "experiment"-specific plug-ins for the PanDA Pilot. Examples of "experiment" plug-ins are the ATLASExperiment and ATLASSiteInformation classes containing all the specifics related to the ATLAS Experiment, i.e. all the code that is unique and relevant only for ATLAS. This will include methods for how the payload should be setup, how metadata should be handled and which format it should have, how "queuedata" (geometrical DB information for a specific grid site) should be downloaded, etc. ATLASExperiment inherits from Experiment, and ATLASSiteInformation from SiteInformation.

An experiment that is adopting PanDA and wants to use the PanDA Pilot, should only have to implement the virtual methods declared in the Experiment and SiteInformation classes. This document will give guidelines on how to do this.

The PanDA Pilot

A generic approach in grid computing is to use pilot jobs. Instead of submitting jobs directly to the grid gatekeepers, pilot factories are used to submit special lightweight jobs referred to here as pilot wrappers that are executed on the worker nodes. In the case of ATLAS, the pilot wrappers perform initial checks, download the main pilot code and launch it. The pilot is responsible for downloading the actual payload and any input files from the Storage Element (SE), executing the payload, uploading the output to the SE, and sending the final job status to the PanDA server.

A list of publications and presentations regarding the PanDA Pilot can be found on the PanDA wiki.

General Structure of the PanDA Pilot

The PanDA Pilot is written in the Python language and has a modular structure. The main code is divided in two modules, pilot and monitor, to facilitate the introduction of glExec [ref] which offers user identity switching. While it would have been easier to do any user identity switching before the pilot is launched by the wrapper, it was not an alternative for ATLAS because of the highly useful multi-job mechanism that allows a single pilot to download and run multiple payloads from different users in sequential order until it runs out of a predetermined time. The pilot contains the multi-job loop, and starts the monitor after having successfully downloaded a job (defined as a payload and all the information needed to execute it). The monitor, in turn, forks and monitors a subprocess module which is in charge of serving the payload. Stage-in and stage-out are handled by Site Mover modules. A number of utility type modules exist that serve the main modules, including server communication, TCP messaging, error code handling and interpretation, job definition, as well as other utilities. More details on the main pilot modules and special mechanisms are given in the sections below.

Plug-in Mechanism

A new PanDA Pilot user should in principle only have to implement certain methods (marked as "compulsory" in the code, and highlighted with blue numbered circles in this document) in the plug-in *Experiment and *SiteInformation classes. The plug-in classes contain methods that are experiment-specific but sorted into two different classes. The *Experiment classes contain methods related to payload setup, how the subprocess (see below) should be launched, how metadata should be handled, which files should be removed from the payload work directory before the job log file is created, etc. The *SiteInformation classes contain methods for queuedata handling (actual site information from a DB), how it should be downloaded, from where, and how to verify its integrity, as well as how to manipulate it (e.g. queuedata parameter overwrites which can be useful for special testing), etc.

Default methods are provided in the base Experiment and SiteInformation classes, but can be overridden in the derived classes.

Example: How to implement *Experiment and *SiteInformation classes.

An experiment called TEST wants to use the PanDA Pilot. Check-out the latest PanDA Pilot version from the SVN repository (link will be provided upon request). Make a copy of "OtherExperiment.py", and call it "TESTExperiment.py". Implement the compulsory methods listed in the prototype class. Do the same for "OtherSiteInformation.py". Note the name of the experiment at the beginning of the modules, set it to a relevant name (here called TEST). This name will be specified when launching the pilot, so that the pilot will know which experiment to serve (syntax: python pilot.py [options] -F <experiment>).

When the pilot wrapper is launching the PanDA Pilot, it gives it a set of options that defines the initial state of the pilot. The pilot option "-F " is used to select a certain experiment. Internally, the value is used to generate the proper Experiment and SiteInformation objects via factories (ExperimentFactory and SiteInformationFactory) based on the Factory Design Pattern. In order to be able to generate a proper class instance, the newly created classes have to be added to the ExperimentFactory and SiteInformationFactory classes. Simply import the new class at the top of the factory classes, and add a test case in the main function (see e.g. how it is done for class "Other"). More details about the *Experiment and *SiteInformation classes can be found in the sections below.

General Workflow of the PanDA Pilot

This section outlines the workflow of the PanDA Pilot and provides information about when methods, that are compulsory to implement, are called. Note: several compulsory methods are already implemented in the base classes, which might only be relevant for ATLAS. If any detail needs to be changed in these methods, simply make a copy of the default method in the base class and add it to the derived class and make the necessary changes there. Major steps, indicating the need for experiment-specific code modifications, are marked with numbers.

Pilot_workflow_glExec.jpg

Figure 1. PanDA Pilot workflow.

1.png Upon launch, the PanDA Pilot starts with downloading queuedata, unless it is already downloaded by e.g. the wrapper (or by hand, which can be useful during interactive testing). Queuedata handling is provided by the SiteInformation getQueuedata() method (via handleQueuedata() in pUtil.py, called from pilot.py). Queuedata can be post processed, or overwritten, by the SiteInformation postProcessQueuedata(), which can be practical during testing. The default getQueuedata() method uses a secure curl command to download the queuedata from the PanDA server. In case experiment-specific changes are needed, these methods can be overridden in the relevant subclass (*SiteInformation).

For reference, the current list of schedconfig fields used by the pilot can be seen by this command: curl -sS "http://pandaserver.cern.ch:25085/cache/schedconfig/CERN-PROD-all-prod-CEs.pilot.json". The full queuedata dump for the queue in the curl example above can be seen here. The schedconfig fields themselves, are explained in another wiki page.

2.png The pilot then performs a set of special checks and displays some information that can be unique to an experiment. For ATLAS, the pilot tests the import of the LFC module, verifies CVMFS via an external script and displays the CVMFS ChangeLog, as well as system architecture information. Any test failure can be used to abort the pilot. For implementation details, see the specialChecks() (method to be renamed) method in the Experiment and ATLASExperiment classes.

A signal handler is declared, that will handle and act on incoming kill signals. The following signals are currently intercepted: SIGTERM, SIGQUIT, SIGSEGV, SIGXCPU, SIGUSR1, and SIGBUS. If such a signal is received, the pilot will immediately inform the PanDA server and then proceed to kill the remaining subprocesses (if any). It is important that the SIGKILL signal does not arrive too soon after the initial kill signal, since it can take up to a few minutes for the pilot to kill all subprocesses and perform the cleanup. The time difference between SIGKILL and other kill signals can be fine tuned in case the kill signal originates from the batch system. Approximately four minutes of time difference is an absolute minimum time limit, but a longer time is recommended.

If the site, where the pilot is running, has opted for job recovery (as set in schedconfig.retry), an algorithm will be run that looks for output files deliberately left behind by previous pilots that encountered problems during stage-out. The stage-out of these files will be re-attempted for a limited amount of tries. On sites where the pilot is running within a closed sandbox, the batch system will purge the sandbox at the end of the job. In this case, a special recovery directory can be defined using schedconfig.recoverdir.

In case a hard kill signal arrives before the pilot has finished properly, the corresponding work directory can be left stranded on the worker node and will not be cleaned up by the batch system unless the job was run inside a sandbox. To this end, the pilot is equipped with a cleanup algorithm that searches for and purges any work directories that have not been touched for a very long time. The cleanup algorithm will perform a deeper analysis of the found directory if it has not been modified for two hours, but will only delete the directory if it has not been touched for seven days. The primary cleanup mechanism on such a site, however, is performed by special cleanup cron jobs. The pilot based cleanup is an extra layer of protection in case the cleanup cron jobs are not catching all cases, as has been seen in the past.

The Multi-Job Loop (pilot module)

The main loop of the PanDA Pilot (pilot.py) is the so-called multi-job loop. This refers to the capability of running multiple jobs, from different users, sequentially until the pilot runs out of time. This is a site configurable feature controlled via the schedconfig.timefloor field. If unset, or set to zero, the pilot will only execute one single payload. If it is set to a value greater than zero, the multi-job loop will download and execute payloads until it runs out of time. The multi-job feature is of particular use when there are many short jobs are in the system (the pilot launch overhead is bypassed).

Example: schedconfig.timefloor = 60 (value is in minutes). A job (setup, stage-in, payload execution, stage-out) from user X runs for 25 minutes. At the end of that job, the pilot sees that 35 minutes remain to execute jobs, so it downloads and executes another job. The second job runs for 120 minutes. At the end of the second job, the pilot sees that it has run out of time to launch new jobs, and the multi-job loop (and the pilot itself) finishes.

The multi-job loop begins with the creation of the current work directory. A job is downloaded from the PanDA server (job dispatcher). As an alternative, the job can also be downloaded in advance or be created by hand and be placed in a file, which will be found and used by the pilot. If a job cannot be found, in a server download, the pilot waits one minute and then tries again. If a job still cannot be found, the pilot aborts. When asking for a job, the pilot sends worker node details (e.g. CPU and memory information) along with other parameters (e.g. pilot user DN in case he or she only wants a job corresponding to the matching DN) to the job dispatcher which decides if it should return a job to the pilot or not. A successfully downloaded job dictionary is stored in a file for later use.

The Monitoring Loop (Monitor module)

The PanDA pilot is equipped with glExec (currently in development). Because of the multi-job loop, any user switching cannot be performed by the pilot wrapper before the pilot is launched which would be the normal approach. To solve this problem, the inner part of the multi-job loop (henceforth referred to as the monitoring loop) is placed in a separate module called Monitor. glExec user identity switching is done before the Monitor is executed. The Monitor module is used either directly by the pilot, in the case glExec is not used, or after the glExec user identity switching from inside the glExec interface.

The main method of the Monitor module begins with creating a TCP server run as a separate thread. The TCP server will listen to messages sent from a subprocess (see below), which is responsible for the payload. The pilot performs local checks, including a verification of the pilot user proxy and whether there is enough disk space left for running the job. The remaining time of the VOMS proxy extension must be over 48 hours, and the local disk must have at least 5 GB free space left.

3.png The pilot extracts and verifies the existence of the software directory, if defined in schedconfig.appdir, using the extractAppdir() method in the SiteInformation class. Alternatively, it can get the software directory from an environmental variable (such as $VO_ATLAS_SW_DIR in the case of ATLAS). The schedconfig.appdir can be encoded with additional alternative software directories, using the format: appdir = default_software_directory|processing_type1^additional_path1|processing_type2^additional_path2|.., where processing_type is a special identifier (e.g. nightlies, build, release) that identifies the software directory and is specified in the job description by a field called processingType.

4.png The pilot proceeds with forking the subprocess. The parent process corresponds to the monitoring loop (see Monitor.py, a module used by pilot.py), while the child process can be any of the subprocesses described in the Subprocess module section below. The default subprocess is an instance of the RunJob class, which performs payload setup, stage-in of input files, payload execution and stage-out of output files. Additional subprocess modules are currently in development. The subprocess selection has been made experiment-specific in case further subprocess modules need to be developed for special purposes.

Every ten minutes, the main monitoring loop checks the remaining disk space, the number of running processes, that the entire pilot has not run out of time on the worker node, and that the payload is not looping (stuck). It can also make sure that the sizes of output files are within a predefined limit (optional). A 60 s sleep marks the end of a monitoring loop iteration. When job has ended, its work directory is tarred up and is staged out by the main pilot process. Any lingering orphan subprocesses (e.g. from a failed payload) will be identified and killed. As a last step, the payload work directory is deleted and the Monitor returns to the multi-job loop in the pilot module.

At the end of the multi-job loop, the pilot checks if there is time to run another job. If so, the multi-job loop continues, otherwise the pilot ends.

Subprocess modules

The pilot/Monitor forks and monitors a subprocess that is responsible for the payload or for submitting jobs into an HPC. This subprocess can in principle be any module that needs the full attention and supervision of the pilot/Monitor. It is recommended, although not necessary, to follow the plug-in structure with the RunJob classes. The primary PanDA Pilot subprocess module is called RunJob. Several additional subprocess modules are currently in development; RunJobEvent will be used to read and process events from an Event Server, and RunJobHPC will be used for handling HPC jobs. For details regarding developing a new RunJob class, please see the corresponding RunJob document.

The getSubprocessName() and getSubprocessArguments() methods declared in the Experiment class are used to determine which subprocess module to use and how it should be setup. The former method determines the name of the subprocess (e.g. "RunJob" for a normal job) while the latter method decides which arguments should be used to launch the corresponding python module and returns them as a list.

Subprocess_modules2.png

Figure 2. Pilot/Monitor workflow indicating the various subprocess modules.

The following subsections describe the main steps of the RunJob module.

Preliminaries

5.png When the payload is executed by the RunJob module, the corresponding stdout and stderr are redirected to files <name>_stdout.txt and <name>_stderr.txt where name can be defined by the getPayloadName() method declared in the *Experiment class.

Payload setup

6.png Before the payload can be executed, the pilot must know how exactly it should be setup. This is handled by the getJobExecutionCommand() method located in the Experiment class. This method essentially prepares a string with the proper setup scripts, including the source commands, followed by the name of payload script and its arguments, or the corresponding wrapper script that in turn runs the actual payload (in ATLAS this wrapper script is referred to as the transform, or the TRF). Any setup scripts should be verified that they exist and work. The entire setup string produced by the getJobExecutionCommand() method will be executed in its own shell using commands.getstatusoutput.

Stage-in of input files

The pilot receives information (dataset names, local file names and GUIDs) about which input files should be staged in, if any, from the PanDA Server/Job Dispatcher when the job is downloaded. ROOT files in user analysis jobs, can be opened remotely (a.k.a. remote I/O or direct access mode) in which case stage-in is not performed. The pilot still needs to give the full path to the input files in order for the payload wrapper script (runGen or runAthena in the case of ATLAS) to be able to read the file remotely, which means that the pilot has to translate SURL based file paths to TURLs. This is done using information from the schedconfig DB or by using the lcg-getturls command (see the main PanDA Pilot wiki page for more details).

7.png For normal stage-in, the pilot needs to know 1) which copy tool to use, and 2) how to lookup the file information in a file catalog. The copy tool to be used is defined by the schedconfig copytool or copytoolin fields. Two fields are needed since stage-in and stage-out can use different copy tools. In case the same copy tool is to be used for both stage-in and stage-out, only copytool needs to be set, and copytoolin can be left blank. Common copy tools include: lcg-cp, xrdcp, and lsm (local site mover, see documentation). Otherwise, both fields need to be set. For a given copytool value, a Site Mover factory is used to return a corresponding Site Mover object. E.g. for copytool = "xrdcp", the Site Mover factory returns an instance of the xrdcpSiteMover class. Each copytool supported by the pilot has a corresponding class, inheriting from the parent SiteMover class containing common code.

The Pilot has support for both LFC and Rucio file catalogs. For LFC based file lookup, willDoFileLookups() (Experiment class) should return True. For Rucio, the same method should return False.

Stage-out of output files

The pilot needs to know how and where exactly it should transfer the output files to. The copy tool for stage-out is defined by the schedconfig copytool field (note: copytoolin is never used for stage-out) and the corresponding site mover is selected (as described in the previous section) accordingly by the pilot.

8.png The site mover uses the SiteInformation method getProperPaths() to generate the stage-out path based on information from several schedconfig fields (se, seopt, se[prod]path, setokens). For full information regarding these and other schedconfig fields, see this wiki page. For Rucio style file path generation, it is enough to add /rucio at the end of se[prod]path. When the pilot encounters this substring, it will automatically use Rucio style paths. In that case, the file path will depend on dataset name, local file name and scope. For more information, see the Rucio documentation.

The pilot is also equipped with a mechanism for stage-out to an alternative Storage Element (SE). The idea is that if the pilot fails (partially or completely) to stage-out at the primary SE (Tier-2 in ATLAS), it will download new queuedata from the schedconfig DB for an alternative/secondary SE in the same cloud (Tier-1 in ATLAS) and re-attempt stage-out there. To allow this mechanism, or not, implement the SiteInformation allowAlternativeStageOut() method, and return the proper boolean value.

Experiment Class

The Experiment class and its subclasses contain all functionality that is experiment-specific, and not related to any site information (see the SiteInformation section below).

9.png The different Experiment subclasses are distinguished by their data member __experiment (string). In the case of ATLAS, this is set to "ATLAS". The are two types of class methods, compulsory and optional. Compulsory means that the method must be present in the class since the pilot will expect it to exist and will call it. The following methods should be implemented in the Experiment subclass:

  • getJobExecutionCommand() Define and test the command(s) that will be used to execute the payload (see the Subprocess Modules section above)
  • willDoFileLookups() Should (LFC) file lookups be done by the pilot or not?
  • doFileLookups() Update the file lookups boolean (affects willDoFileLookups()). Will most likely become deprecated
  • removeRedundantFiles() Removes redundant files and directories prior to job log creation (everything else will be left in the work directory and tarred up)
  • specialChecks() Execution of special checks (e.g. test imports of LFC python module, CVMFS consistency checks, etc) at the beginning of the pilot - method can be empty
  • checkSpecialEnvVars() Execution of check for special environment variables (e.g. VO_ATLAS_SW_DIR) at the beginning of the pilot - method can be empty
  • updateJobSetupScript() This optional method is used by the pilot to add essential commands (especially the payload execution command) to a script (job_setup.sh) that can be used to recreate the job off the grid. The script ends up in the job log tarball. ATLAS has a minor tweak to this method (adding of some extra environmental variables when the script file is created)
  • verifyProxy() can be used to certify that the grid proxy is valid. The default skeleton implementation only returns exit code 0 (as well as an empty error diagnostics string)
  • get/setCache() can be used to retrieve the cache URL which is optionally set with pilot option -H. Used e.g. by LSST
  • useTracingService is used for tracing file transfers with the DQ2 Tracing Service. Currently only used in ATLAS. Default is False - currently in development, site movers will be refactored for this purpose. Method is used from Mover.py

SiteInformation Class

The SiteInformation class and its subclasses contain all experiment-specific functionality that is related to site information. Identical to the Experiment classes, the different SiteInformation subclasses are distinguished using the data member __experiment (string). Methods that are compulsory to implement include (note that the parent SiteInformation class contains default implementations):

10.png
  • readpar() Read parameter value (schedconfig field) from queuedata file
  • getQueuedataFileName() Define the queuedata filename (default: queuedata.json)
  • verifyQueuedata() Verify the consistency of the queuedata
  • getQueuedata() Download queuedata from somewhere (PanDA Server by default)
  • postProcessQueuedata() Update queuedata fields if necessary (for testing purposes; method can be left empty but must exist)
  • extractAppdir() Extract and confirm appdir from possibly encoded schedconfig.appdir
  • verifySoftwareDirectory() Should the software directory (schedconfig.appdir) be verified? (Boolean)
  • allowAlternativeStageOut() Is alternative stage-out allowed? (Boolean). If the standard/normal stage-out fails, the pilot has the possibility of staging out the files to an alternative site. In ATLAS this works from a T-2 to the T-1 that the job belongs to.
  • forceAlternativeStageOut() Force the stage-out to always use the alternative SE. (Boolean). See previous method.
  • getProperPaths() Return proper paths for the storage element used during stage-out

Error handling

Any error information from a failed wrapper script, or payload, should be reported in a special JSON file, jobReport.json. Specifically, this file must report at least the following:

  • exitCode : Integer error/exit code, e.g. "65"
  • exitAcronym : Abbreviated error/exit acronym, e.g. "TRF_EXEC_FAIL"
  • exitMsg : Descriptive error/exit message, e.g. "Non-zero return code from FTKRecoRDOtoESD (139)"

The pilot will read this file after a payload failure and simply forward the information to the PanDA server. It will later be used by the PanDA Monitor in a special error summary page. As an example, see the Job Error Summary page for ATLAS jobs.

Pilot development and releases

It is recommended to create a new directory in the SVN "branches" directory, and do the pilot development there. The latest pilot version can be found in the SVN "tags" directory and the latest stable/semi-stable development version of the pilot can be found in SVN "branches/PreRelease" (to be created..). When a version is ready for release, it should be announced in the pilot development mailing list (to be created..) after which it will be merged with the current main development version of the pilot by the main pilot developer. The pilot development mailing list can also be used as a discussion forum.


Major updates:
-- PaulNilsson - 05-Jun-2013



Responsible: PaulNilsson

Never reviewed

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng 1.png r1 manage 3.7 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng 10.png r1 manage 4.1 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng 2.png r1 manage 3.9 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng 3.png r1 manage 4.0 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng 4.png r1 manage 3.8 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng 5.png r1 manage 3.8 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng 6.png r1 manage 4.0 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng 7.png r1 manage 3.8 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng 8.png r1 manage 4.2 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng 9.png r1 manage 4.1 K 2014-06-16 - 18:09 PaulNilsson  
JPEGjpg Pilot_workflow_glExec.jpg r1 manage 87.4 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng Subprocess_modules.png r1 manage 67.3 K 2014-06-16 - 18:09 PaulNilsson  
PNGpng Subprocess_modules2.png r1 manage 84.6 K 2014-07-08 - 17:42 PaulNilsson  
Topic revision: r10 - 2014-09-15 - PaulNilsson
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback