Monitor
Introduction
The Monitor module is responsible for monitoring the payload, transfers, memory usage, disk space, etc. It launches the payload execution module (
RunJob*).
Monitoring tasks:
a) Remaining local space.
b) Size of payload stdout.
c) Memory usage (currently not enforced).
d) Collection of unmonitored jobs.
e) Processes.
f) Size of output files (in effect not enforced due to very large allowed sizes).
g) Looping jobs (stuck processes and file transfers).
Main workflow
Main functions used by monitor_job(), first method used by the pilot:
1. [Soon to be deprecated]
PilotTCPServer() class
TCP server used to send TCP messages from runJob to pilot.
2. node.collectWNInfo()
Collect WN info again to avoid getting wrong disk info from gram dir which might differ from the payload workdir.
3. __getsetWNMem()
Overwrite mem since this should come from either pilot argument or queuedata.
4. Experiment.verifyProxy()
Verify that the proxy is valid.
5. __checkLocalDiskSpace()
Do we [still] have enough local disk space to run the job? updatePilotErrorReport() is used in case of failure.
6. [Soon to be deprecated] pUtil.isPilotTCPServerAlive()
Make sure the pilot TCP server is still running.
7. [Soon to be deprecated] pUtil.stageInPyModules()
Copy some supporting modules to the workdir for pilot job to run.
8. __createJobWorkdir()
Create the job workdir (Panda_Pilot_*/PandaJob_
_<time.time()>) using job.mkJobWorkdir().
9. thisExperiment.updateJobSetupScript()
If desired, create the job setup script (used to recreate the job locally if needed). Note: this step only creates the file with the script header (bash info).
10. createFileStates()
[To be deprecated?] Create the initial file state dictionary for input (if there are input files) and output (two files).
11. pUtil.verifyLFNLength()
Are the output file names within the allowed limit? Using updatePilotErrorReport() and pUtil.postJobTask() in case of failure.
12. __throttleJob()
Throttle the job before starting it.
Main functions after process forking:
13. [Parent process] pUtil.updatePandaServer()
Send a heartbeat to the server.
14. [Child process] Experiment.getSubprocessName()
Decide which subprocess to launch.
15. [Child process] Experiment.getSubprocessArguments()
Get the arguments needed to launch the subprocess.
16. [Child process] pUtil.stageInPyModules()
Copy all python files to payload workdir from site workdir.
17. [Child process] Experiment.updateJobDefinition()
Update the job definition file (and env object) before using it in RunJob (if necessary).
18. [Child process] __backupJobDef()
Backup the job definition (to the payload workdir, in order to save it in the log file).
19. [Child process] os.execvpe()
Launch the proper RunJob*. os.execvpe() does not return until RunJob* has finished.
Main monitoring loop [Parent process]:
20. __check_memory_usage()
Every minute check the memory usage of the payload if required (e.g. not for HPC:s). Note: requires that the memory monitoring tool is executed (as specified in Experiment::shouldExecuteUtility())
21. __check_remaining_space()
Every ten minutes, check the remaining disk space, the size of the job workdir, and the size of the payload stdout file.
a) __checkPayloadStdout(): Check the size of the payload stdout.
b) collectWNInfo(): Update the worker node info to get the remaining disk space.
c) __checkLocalSpace(): Check the remaining local disk space during running.
d) __checkWorkDir(): Check the size of the payload workdir.
22. create_softlink()
Wrapper around __createSoftLink(), which creates a soft link to the athena stdout in the site work dir.
23. check_unmonitored_jobs()
Make sure all jobs are being monitored. If not, then add the job to jobDic dictionary.
24. __monitor_processes()
Monitor the number of running processes and the pilot running time.
a) processes.checkProcesses(): Check the number of running processes (using processes.findProcessesInGroup()).
b) __failMaxTimeJob(): If less than 10 minutes to batch time limit, abort the job (using processes.killProcesses()).
25. __verify_output_sizes()
Verify output file sizes every ten minutes, using __checkOutputFileSizes(), __verifyOutputFileSizes().
Note: current limit is set very high (500 GB) since the limit has not been agreed on.
26. [Currently not used] __verify_memory_limits()
Using getMaxMemoryUsageFromCGroups().
27. __check_looping_jobs()
Look for looping jobs every 30 minutes.
a) __allowLoopingJobKiller(): Should the looping job killer be run? Using __runJob.allowLoopingJobKiller().
b) __loopingJobKiller(): Look for looping job and kill if necessary with __killLoopingJob()
28. __wdog.pollChildren()
Check if any jobs are done by scanning the process list.
29. __cleanUpEndedJobs()
Clean up the ended jobs (if there are any). Note: inlcudes a thread pool. Calls __cleanUpEndedJob() which in turn
has calls to killOrphan() and killProcesses(). It especially calls the pUtil.postJobTask() [which creates the log file
and sends the final server update].
30. __wdog.collectZombieJob()
Collect zombie child processes.
(followed by cleanup, some measurements and return to caller module (pilot))
Main monitoring loop (above) continues until the job has finished. Sleeping 60 s between each iteration.
Notes
1. Monitor object is instantiated by main pilot.
2. Various pilot timings are set in this module.
3. loggingMode (from pilot option -y), used by pUtil.updatePandaServer() (used by the Monitor module) to send a space report, is deprecated.
4. The main function (monitor_job()) is protected with a try-statement.
-- PaulNilsson - 2016-05-17
Major updates:
Responsible: PaulNilsson
Never reviewed