RunJob

Introduction

For a standard grid job, the pilot's Monitor module forks a subprocess and monitors it for the duration of the job. The subprocess, RunJob (formerly called 'runJob') is responsible for determining the payload setup, perform any stage-ins, execute the payload and stage-out its output at end of the job. To facilitate the introduction of other subprocess modules with different workflows, the older runJob module has been refactored into the RunJob class that has the same plug-in structure as the Experiment and SiteInformation classes. RunJob will eventually become a "full" base class, only containing common methods for all sub classes that inherits from it. At that point, the bulk of the current methods in RunJob will be moved to RunJobNormal, i.e. a class that will be used for running normal PanDA jobs (see the diagram below).

The class responsible for Event Service jobs is called RunJobEvent. The classes containing the HPC code are called RunJobHPC, RunJobTitan and RunJobMira (RunJobIntrepid?). The inheritance structure is shown in the diagram below. Code that is common for different HPCs will be placed in RunJobHPC.

RunJobClassDiagram.png
RunJob class diagram

Responsibilities

It was recommended by Kaushik on the Event Service meeting on June 19, 2014, that Danila will be responsible for the RunJobHPC (the base class for the HPC code) and RunJobTitan classes, that Taylor and Doug will be responsible for the RunJobMira class, and that Paul will be responsible for RunJob and the overall structure. Note (2016): Wen is responsible for the RunJob AES class used on HPC:s.

Implementation

A factory class (RunJobFactory) is used to return an instance of the proper sub class assuming that the name is known, e.g. 'normal', 'hpc', 'titan', 'eventservice', etc. The name is a string that is defined as a private data member, called __runjob in the RunJob classes. Unlike the Experiment and SiteInformation classes, RunJob classes have less restrictions and demands as to which methods are required.

It is recommended to copy the argumentParser() method from RunJob and update it accordingly in the relevant sub class. This will allow the sub class to be executed like 'python RunJobTitan.py -a option1 -b option2 ..', which in turn means that a main function should be defined in the sub class (again, see RunJob.py, search for "__main__" at the bottom of the file and add the workflow). The pilot will assume that the argument list is defined in getSubprocessArguments() in the Experiment class. In the same class, update getSubprocessName() accordingly. I.e. add an if-statement for name = 'Titan' (e.g.) and add the corresponding logic.

Description (RunJob module)

This module is responsible for:

1) Preparing the setup for the payload.
2) Stage-in of any input.
3) Executing the payload.
4) Stage-out the payload.
i.e. for "normal" grid jobs on "normal" sites.

Main workflow

The main functions of the RunJob modulee are as follows:

1. The job definition is imported (newJobDef) as a python module.

2. runJob.argumentParser()

RunJob options are interpreted by this function.

3. Set node info

a) Node.setNodeName() Set codename data member.
b) Node.collectWNInfo() Collect node information (cpu, memory and disk space).

4. Signal handling (sig2exc())

Definition of signal handler followed by registration of supported signals: SIGTERM, SIGQUIT, SIGSEGV, SIGXCPU, SIGUSR1, SIGBUS

Setup section.

5. setup()

Prepare the setup and get the run command list. Note: running multiple trf:s need to be supported.

a) RunJobUtilities.verifyMultiTrf(), in case of multiple trf:s.
b) thisExperiment.getJobExecutionCommand(), called for each trf.

(Pilot server is called before and after the setup() function).

Stage-in section.

6. stageIn()

Stage-in all input files (if necessary).

7. RunJobUtilities.updateRunCommandList()

Update the run command if necessary. Some trf options might have to be removed if they not relevant (e.g. remove --directIn for the case that all files were actually staged in).

8. unzipStagedFiles()

Unzip the staged in file if necessary (i.e. if the input file was zipped).

Payload execution section.

9. RunJobUtilities.setEnvVars()

Set special env variables if necessary.

10. executePayload()

Execute the payload.

a) downloadEventRanges() for clone jobs ('runonce')
b) extractEventRanges() create a list of event ranges from the downloaded message
c) getSubprocess() start the subprocess (payload)
d) getUtilitySubprocess() start the utility subprocess if necessary (memory monitor for ATLAS)
e) Copy the utility output if necessary to the pilot init dir
f) pUtil.setTimeConsumed() get the CPU consumption time

Stage-out section.

11. FileHandling.extractOutputFiles()

If possible, get the list of output files from the jobReport created by the trf. This might overwrite the file list in the job description. Note: currently only used for ATLAS production jobs.

a) extractOutputFilesFromJSON(): extract the files from the JSON
b) removeNoOutputFiles(): remove any files that don't actually exist

12. FileHandling.getDestinationDBlockItems()

Return destinationDBlock items (destinationDBlockToken, destinationDblock, scope) for a given file. Note: only called if extractOutputFiles() above returned any output files

a) filterSpilloverFilename() Remove any   _NNN or ._NNN from the file name (in case of spill over file)
b) getOutputFileItem() Which is the corresponding destinationDBlockToken for this file?
c) getOutputFileItem() Which is the corresponding destinationDblock for this file?
d) getOutputFileItem() Which is the corresponding scopeOut for this file?

13. RunJobUtilities.prepareOutFiles()

Verify and prepare the output files for transfer. Make sure the file(s) exist. Find the mod time of the file (needed for ND).

14. Create an XML string/file for the output+log files to pass to the server (metadata-.xml)

The following functions are used:

a) getDatasets() Get the datasets for the output files
b) moveTrfMetadata() Rename the existing trf metadata file for non-build jobs (metadata.xml -> metadata-<jobId>.xml.PAYLOAD)
c) createFileMetadata() Create the metadata for the output+log files   
c1) RunJobUtilities.getOutFilesGuids() get the outFilesGuids from the PFC
c1a) getGUIDSourceFilename() Return "PoolFileCatalog.xml" (for ATLAS), if empty string - pilot will generate guids
c2) pUtil.getOutputFileInfo() Return lists with file sizes and checksums for the given output files
c2a) SiteMover.getLocalFileInfo() Return exit code (0 if OK), file size and checksum of a local file, as well as as date string if requested
c3) pUtil.PFCxml() Create preliminary metadata (no metadata yet about log file - added later in pilot.py)

15. RunJobUtilities.convertMetadata4NG()

convert the preliminary metadata-.xml file to OutputFiles-.xml for Nordugrid.

16. downloadEventRanges()

If clone job (job.cloneJob == "storeonce"), make sure that stage-out should be performed; in case an event range is downloaded, then it's ok to proceed and stage-out the file(s). If no event ranges are downloaded, then the clone job was already executed and stored - no need to proceed.

17. stageOut()

Stage-out output files.

a) pUtil.PFCxml() Generate the xml (OutPutFileCatalog.xml) for the output files and the site mover
b) mover.mover_put_data() Perform the stage-out
Note: mover_put_data() is further described in the Mover wiki document (for Mover.py).


Major updates:
-- PaulNilsson - 16 Jun 2014 -- PaulNilsson - 20 Jun 2016



Responsible: PaulNilsson

Never reviewed

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2016-06-21 - PaulNilsson
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback