Site Movers
Introduction
This document will describe the site movers used by the
PanDA Pilot (Pilot 1.0 and 2.0).
General concept and workflow description:
1. The site movers use the DDMEndpoint as a main identifier to store output/log files and from which DDMEndpoints (actually from all local ddmendpoints that belong to the same ATLAS site resolved by suggested ddms from Job.ddmendpointIn)
2. A PanDA Queue (PQ) specifies in the shedconfig fields which copytools (e.g. xrdcp, rucio, lsm) are supported/enabled. It is possible to specify a prioritized list of copytools specific for stage-in (activity='pr'), stage-out (activity='pw') and stage-out-logs
(activity='pl') (via PQ.acopytools field). If no specific copy tools has been assigned to a given activity (e.g. 'pr') then all supported copytools defined in PQ.copytools field will be used by default.
3. Each site mover implementation should provide the list of accepted protocol schemes that can be used by the site mover instance (e.g. the xrdcp site mover expects to operate with the root:// protocol)
4. The top site mover manager (JobMover class) takes each file to be transferred, fetches all protocol settings from the DDM JSON export (specific for a given PanDA activity - e.g. activity='pw', if not set then fallback to normal DDM activity, e.g activity='r'), then iterates over the supported copytools and considers only accepted protocols (matched to copytool.schemes) for real transfers
5. If one copytool is not able to transfer file (for whatever reasons; e.g. code exception, broken site mover logic or resource/file is not accessible through given copytool/protocol) then the next available copytool will be used for transfer the file
6. The PQ specific protocols/copytools can be declared at the level of the PQ (PQ.aprotocols field) and will be used first (overwrites default settings from the DDM protocols)
7. For stage-in, the site mover considers input replicas resolved by Rucio (using list_replicas()). It is still possible to overwrite base protocol path using PQ.aprotocols settings.
This flexibility can be used in case when Rucio contains wrong protocols definition (for example some xroot doors for today are wrongly defined in AGIS/Rucio) or what is more important to consider protocols by Pilot not even defined in Rucio (for example local dcap protocols) or for testing.
Real filename (lfn) is resolved from one known replica by matching a SURL replica (srm, or whatever default protocol specified under specific activity 'SE' -> then fallback to 'a' then to 'r' of DDMEndpoint.aprotocols) against all replicas returned by Rucio
Direct access mode
Direct access mode allows to skip copying of input (ROOT) files to local node during stage-in and assumes that payload execution script/command expects to read input file directly from well-formed by sitemovers TURL endpoint.
For analysis jobs, in case of direct access mode activated the sitemover module will automatically append --directIn option (in addition to --usePFCTurl which is always added by new sitemovers arhitecture) to the payload execution command.
schedconfig.direct_access_lan
boolean field is used to enable direct access mode at the level of PQ.
Job definition could overwrite direct access mode using
job.accessmode
attribute.
The full logic is following:
# fspec is FileSpec instance of input file to be checked/transferred
is_directaccess = JobMover.is_directaccess() ## which actually checks and relies on schedconfig.direct_access_lan value
if job.accessmode == 'copy':
is_directaccess = False
elif job.accessmode == 'direct':
is_directaccess = True
if fspec.is_directaccess() and is_directaccess: # direct access mode, no transfer required
fspec.status = 'direct_access'
self.log("Direct access mode will be used for lfn=%s .. skip transfer the file" % fspec.lfn)
Spapshot of this logic in the sitemover sources
FileSpec.is_directaccess() current implementation
Functional workflow
Top level
JobMover
class controls the execution of sitemovers functionality.
It's expected that
PanDA/RunJob module (or similar ones which should consume sitemover functionality) will initialize
JobMover instance and execute one the 3 main functions:
-
JobMover.stagein()
to apply stage-in transfers
-
JobMover.stageout_outfiles()
to apply stage-out of payload execution result files
-
JobMover.stageout_logfiles()
to apply stage-out transfers of log files to normal SE
In current implementation, to easily integrate new sitemover logic within Pilot1.0, following wrapper functions are used to initialize and execute stage-in/stage-out transfers (in particular by
RunJob and
JobLog modules)
-
Mover.get_data_new()
-
Mover.put_data_new()
They decorate old functions (like
Mover.get_data()
) and if a PandaQueue is configured to use new site movers logic using
schedconfig.use_newmover
boolean field then Pilot will automatically switch to new sitemover workflow (no needs to change the Pilot sources).
An example of site mover initialization for stage-in mode (
git source
)
from movers import JobMover
from movers.trace_report import TraceReport
si = getSiteInformation(job.experiment)
si.setQueueName(jobSite.computingElement)
mover = JobMover(job, si, workDir=workDir, stageinretry=stageinTries)
eventType = "get_sm"
if job.isAnalysisJob():
eventType += "_a"
mover.trace_report = TraceReport(localSite=jobSite.sitename, remoteSite=jobSite.sitename, dataset="", eventType=eventType)
mover.trace_report.init(job)
try:
output = mover.stagein()
except PilotException, e:
return e.code, str(e), None, {}
except Exception, e:
tolog("ERROR: Mover get data failed [stagein]: exception caught: %s" % e)
Data structures
Site movers module consumes following configuration JSONs from AGIS/schedconfig/CVMFS caches. Current implementation allows to specify primary source and additional prioritized list of fallbacks in case of communication problems
- DDM Configuration (agis_ddmendpoints.json):
- description: used to fetch DDMEdnpoint specifics as well as the list of protocols supported by required ddmendpoint (
ddm.aprotocols
structure). Sitemovers fetch only the information about affected ddmendpoints for stage-in/stage-out/stageout-logs
- source: AGIS source
, CVMFS (/cvmfs/atlas.cern.ch/repo/sw/local/etc/agis_ddmendpoints.json
)
- current priority list of sources:
CVMFS, AGIS, PANDA
(PANDA source is not configured yet)
- PanDA queue schedconfig data (agis_schedconf.json):
- description: used to fetch PQ specifics protocols (
PQ.aprotocols
) and sitemover/copytool settings (PQ.acopytools
, PQ.copytools
). Sitemovers fetch only the information about required pandaqueues objects
- source: AGIS source
, CVMFS (/cvmfs/atlas.cern.ch/repo/sw/local/etc/agis_schedconf.json
)
- current priority list of sources:
CVMFS, AGIS, PANDA
(PANDA source is not configured yet)
Major updates:
--
PaulNilsson - 2016-05-20
--
AlexeyAnisyonkov - 2016-07-14
Responsible: PaulNilsson
- sitemover simplified class diagram: