TWiki
>
PanDA Web
>
AtlasDistributedComputing
>
PanDA
>
PandaPilot
>
PilotMain
(2016-06-20,
PaulNilsson
)
(raw view)
E
dit
A
ttach
P
DF
<!-- This is the default ATLAS template. Please modify it in the sections indicated to create your topic! In particular, notice that at the bottom there are some sections that must be filled for publicly accessible pages. If you have any comments/complaints about this template, then please email : Patrick Jussel (patrick dot jussel at cern dot ch) and/or Maria Smizanska (maria dot smizanska at cern dot ch) By default the title is the WikiWord used to create this topic if you want to modify it to something more meaningful, just replace %TOPIC% below with i.e "My Topic" =========== you can remove the lines above ============================================= --> %CERTIFY% ---+!!<nop>%TOPIC% %TOC% <!-- this line is optional --> %STARTINCLUDE% ---+Introduction The main PanDA pilot module is called pilot.py and is the code that is launched directly by the wrapper. The pilot module downloads a job definition (payload) from the server (or discovers it locally) and prepares for its execution. The pilot forks a separate process (!RunJob*) which are the modules that directly executes the payload. The pilot monitors (handled by the [[Monitor][Monitor]] module) the !RunJob* process and makes sure that the payload is updating its output files. The main workflow, discussed in functional detail below, contains a sequential multi-job loop. This refers to the ability of running multiple jobs one after the other until the pilot runs out of time, as defined by schedconfig.timefloor variable. During this time, the pilot is allowed to start new jobs. ---+Description This module is the entry point for the PanDA Pilot workflow. It runs job recovery, performs special checks, downloads queuedata, jobs and launches the [[Monitor][Monitor]] instance which in turn launches the payload execution module (RunJob*). ---+Main workflow Main functions prior to multi-job loop: 1. pUtil.handleQueuedata() Download (!SiteInformation.getQueuedata()) and update queuedata if requested (by !SiteInformation.postProcessQueuedata()) 2. Experimenet.specialChecks() [ATLAS] <verbatim> a) displayArchitecture() b) displayChangeLog() c) setPilotPythonVersion(): set ATLAS_PYTHON_PILOT d) testCVMFS(): not called on HPC:s. </verbatim> 3. environment.getBenchmarkDictionary() Run benchmark test if required by experiment site information object. 4. createSiteWorkDir() Create the initial pilot workdir (Panda_Pilot_*). 5. !Experiment.checkSpecialEnvVars() [Useless] Check special environment variables. 6. !WatchDog class [Useless if threads are used?] 7. Signal handling Registration of supported signals: SIGTERM, SIGQUIT, SIGSEGV, SIGXCPU, SIGUSR1, SIGBUS 8. runJobRecovery() [Reimplemented as !DeferredStageout] Job recovery mechanism was designed to recover remains of jobs run by different pilot on the same WN. The current version only handles remaini\ ng output files that were not transferred properly. Note: the job recovery is only needed if alternative stage-out is not used. 9. diskCleanup() Perform local disk cleanup using !Cleaner.cleanup(); <verbatim> a) purgeEmptyDirs(): empty directories are removed in other Panda_Pilot_* directories if not touched for at least 12 hours. b) purgeWorkDirs(): lingering athena directories (Panda_Pilot_*/PandaJob*) are removed if not touched for at least 12 hours. c) purgeMaxedoutDirs(): removing lingering maxed out directories (i.e. if MAXEDOUT job state file is present), if not touched for at least 12 \ hours. d) PanDA Pilot directory (Panda_Pilot_*) clean-up will be _investigated_ if directory not touched for at least 12 hours - actual cleanup only \ for jobs in running, starting, holding states for over one week. </verbatim> Main functions inside multi-job loop: 1. createSiteWorkDir() Create the site.workdir and write the path to file. chmod to 0770. 2. Experiment.verifyProxy() Make sure the proxy lifetime is long enough. 3. node.collectWNInfo() Collect information about the WN. 4. getsetWNMem() Get the memory limit from queuedata or from the -k pilot option and set it. For non-CGROUPS sites, use resource.setrlimit(). 5. checkLocalDiskSpace() Do we have enough local disk space left to run the job? (skip this test for ND true pilots - job will be failed in [[Monitor][Monitor]].monitor_job() inst\ ead if out of disk space). 6. getJob() Download a new job from the dispatcher (or from a pre-placed file). Loop over getNewJob(); <verbatim> a) getDispatcherDictionary(), construct a dictionary for passing to jobDispatcher and get the prodSourceLabel. b) Read pre-placed job definition. c) Call pUtil.httpConnect() to download job definition from server. d) backupDispatcherResponse(), backup response (will be copied to workdir later). </verbatim> 7. !Experiment.postGetJobActions() Perform any special post-job definition download actions. <verbatim> a) [ATLAS] verifyNCoresSettings(): Verify that nCores settings are correct. </verbatim> 8. [[Monitor][Monitor]] instance Launch pilot monitoring. ---+Notes <verbatim> 1. Pilot should return standard shell exit code (pUtil.shellExitCode()) 2. Keep track of when pilot was launched (pilot_startup). Used by [[Monitor][Monitor]] to measure time since pilot startup. 3. checkLocalSE() is implemented by currently not used. It could be used to verify that the local SE is healthy but instead of using lcg-ls (soon to be deprecated) a better (more modern) middleware tool should be used. 4. Pilot option -y <loggingMode> is deprecated. See its use in [[Monitor][Monitor]] which forwards it to !PandaServerClient. 5. The main function (runMain()) is protected with a try-statement. 6. Currently the pilot module houses the large job recovery algorithm. This will be removed to support alternative stage-out instead. </verbatim> <!-- *********************************************************** --> <!-- Do NOT remove the remaining lines, but add requested info as appropriate --> <!-- *********************************************************** --> ----- <!-- For significant updates to the topic, consider adding your 'signature' (beneath this editing box) --> *Major updates*:%BR% -- Main.PaulNilsson - 25-Sep-2012 -- Main.PaulNilsson - 17-May-2016 <!-- Person responsible for the page: Either leave as is - the creator's name will be inserted; Or replace the complete REVINFO tag (including percentages symbols) with a name in the form Main.TwikiUsersName --> %RESPONSIBLE% %REVINFO{"$wikiusername" rev="1.1"}% %BR% <!-- Once this page has been reviewed, please add the name and the date e.g. Main.StephenHaywood - 31 Oct 2006 --> %REVIEW% *Never reviewed* %STOPINCLUDE%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r3 - 2016-06-20
-
PaulNilsson
Log In
PanDA
PanDA Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Cern Search
TWiki Search
Google Search
PanDA
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback