Site Movers

Introduction

This document will describe the site movers used by the PanDA Pilot (Pilot 1.0 and 2.0).

General concept and workflow description:

1. The site movers use the DDMEndpoint as a main identifier to store output/log files and from which DDMEndpoints (actually from all local ddmendpoints that belong to the same ATLAS site resolved by suggested ddms from Job.ddmendpointIn)

2. A PanDA Queue (PQ) specifies in the shedconfig fields which copytools (e.g. xrdcp, rucio, lsm) are supported/enabled. It is possible to specify a prioritized list of copytools specific for stage-in (activity='pr'), stage-out (activity='pw') and stage-out-logs (activity='pl') (via PQ.acopytools field). If no specific copy tools has been assigned to a given activity (e.g. 'pr') then all supported copytools defined in PQ.copytools field will be used by default.

3. Each site mover implementation should provide the list of accepted protocol schemes that can be used by the site mover instance (e.g. the xrdcp site mover expects to operate with the root:// protocol)

4. The top site mover manager (JobMover class) takes each file to be transferred, fetches all protocol settings from the DDM JSON export (specific for a given PanDA activity - e.g. activity='pw', if not set then fallback to normal DDM activity, e.g activity='r'), then iterates over the supported copytools and considers only accepted protocols (matched to copytool.schemes) for real transfers

5. If one copytool is not able to transfer file (for whatever reasons; e.g. code exception, broken site mover logic or resource/file is not accessible through given copytool/protocol) then the next available copytool will be used for transfer the file

6. The PQ specific protocols/copytools can be declared at the level of the PQ (PQ.aprotocols field) and will be used first (overwrites default settings from the DDM protocols)

7. For stage-in, the site mover considers input replicas resolved by Rucio (using list_replicas()). It is still possible to overwrite base protocol path using PQ.aprotocols settings. This flexibility can be used in case when Rucio contains wrong protocols definition (for example some xroot doors for today are wrongly defined in AGIS/Rucio) or what is more important to consider protocols by Pilot not even defined in Rucio (for example local dcap protocols) or for testing.

Real filename (lfn) is resolved from one known replica by matching a SURL replica (srm, or whatever default protocol specified under specific activity 'SE' -> then fallback to 'a' then to 'r' of DDMEndpoint.aprotocols) against all replicas returned by Rucio

Direct access mode

Direct access mode allows to skip copying of input (ROOT) files to local node during stage-in and assumes that payload execution script/command expects to read input file directly from well-formed by sitemovers TURL endpoint. For analysis jobs, in case of direct access mode activated the sitemover module will automatically append --directIn option (in addition to --usePFCTurl which is always added by new sitemovers arhitecture) to the payload execution command.

schedconfig.direct_access_lan boolean field is used to enable direct access mode at the level of PQ. Job definition could overwrite direct access mode using job.accessmode attribute.

The full logic is following:

# fspec is FileSpec instance of input file to be checked/transferred

is_directaccess = JobMover.is_directaccess() ## which actually checks and relies on schedconfig.direct_access_lan value
if job.accessmode == 'copy':
   is_directaccess = False
elif job.accessmode == 'direct':
    is_directaccess = True
if fspec.is_directaccess() and is_directaccess: # direct access mode, no transfer required
   fspec.status = 'direct_access'
   self.log("Direct access mode will be used for lfn=%s .. skip transfer the file" % fspec.lfn)
Spapshot of this logic in the sitemover sources
FileSpec.is_directaccess() current implementation

Functional workflow

Top level JobMover class controls the execution of sitemovers functionality. It's expected that PanDA/RunJob module (or similar ones which should consume sitemover functionality) will initialize JobMover instance and execute one the 3 main functions:

  • JobMover.stagein() to apply stage-in transfers
  • JobMover.stageout_outfiles() to apply stage-out of payload execution result files
  • JobMover.stageout_logfiles() to apply stage-out transfers of log files to normal SE

In current implementation, to easily integrate new sitemover logic within Pilot1.0, following wrapper functions are used to initialize and execute stage-in/stage-out transfers (in particular by RunJob and JobLog modules)

  • Mover.get_data_new()
  • Mover.put_data_new()

They decorate old functions (like Mover.get_data()) and if a PandaQueue is configured to use new site movers logic using schedconfig.use_newmover boolean field then Pilot will automatically switch to new sitemover workflow (no needs to change the Pilot sources).

An example of site mover initialization for stage-in mode (git source)

    from movers import JobMover
    from movers.trace_report import TraceReport

    si = getSiteInformation(job.experiment)
    si.setQueueName(jobSite.computingElement) 

    mover = JobMover(job, si, workDir=workDir, stageinretry=stageinTries)

    eventType = "get_sm"
    if job.isAnalysisJob():
        eventType += "_a"

    mover.trace_report = TraceReport(localSite=jobSite.sitename, remoteSite=jobSite.sitename, dataset="", eventType=eventType)
    mover.trace_report.init(job)

    try:
        output = mover.stagein()
    except PilotException, e:
        return e.code, str(e), None, {}
    except Exception, e:
        tolog("ERROR: Mover get data failed [stagein]: exception caught: %s" % e)

Data structures

Site movers module consumes following configuration JSONs from AGIS/schedconfig/CVMFS caches. Current implementation allows to specify primary source and additional prioritized list of fallbacks in case of communication problems

  • DDM Configuration (agis_ddmendpoints.json):
    • description: used to fetch DDMEdnpoint specifics as well as the list of protocols supported by required ddmendpoint (ddm.aprotocols structure). Sitemovers fetch only the information about affected ddmendpoints for stage-in/stage-out/stageout-logs
    • source: AGIS source, CVMFS (/cvmfs/atlas.cern.ch/repo/sw/local/etc/agis_ddmendpoints.json)
    • current priority list of sources: CVMFS, AGIS, PANDA (PANDA source is not configured yet)

  • PanDA queue schedconfig data (agis_schedconf.json):
    • description: used to fetch PQ specifics protocols (PQ.aprotocols) and sitemover/copytool settings (PQ.acopytools, PQ.copytools). Sitemovers fetch only the information about required pandaqueues objects
    • source: AGIS source, CVMFS (/cvmfs/atlas.cern.ch/repo/sw/local/etc/agis_schedconf.json)
    • current priority list of sources: CVMFS, AGIS, PANDA (PANDA source is not configured yet)


Major updates:
-- PaulNilsson - 2016-05-20

-- AlexeyAnisyonkov - 2016-07-14



Responsible: PaulNilsson

  • sitemover simplified class diagram:
    sitemovers.png
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng sitemovers.png r3 r2 r1 manage 308.7 K 2016-07-14 - 09:41 AlexeyAnisyonkov sitemover simplified class diagram
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2016-07-14 - AlexeyAnisyonkov
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback