PanDA Pilot 2

WARNING: this page is deprecated. Please see https://github.com/PanDAWMS/pilot2/wiki instead.

Introduction

This document describes the PanDA Pilot 2 which is currently in the late development stage.

Developers

Current Developers

Paul Nilsson (Project Leader; General development and management)
Danila Oleynik (HPC)
Wen Guan (AES)
Alexey Anisenkov (Information service)
Doug Benjamin (HPC)
Tomas Javurek (DDM)

Former Developers or Code Contributors

Daniil Drizhuk (General development)
Mario Lassnig (DDM)
Tobias Wegner (DDM)
Pavlo Svirin

Information for new developers, including how to get started with GitHub, can be found here.

Requirements

The functional requirements are being discussed on internal Google Docs. Feature requests for Pilot 1 will be additionally added to the wish list section below.

Documentation

This section describes the main features, workflows and algorithms. The API, modules and functions documentation can be found here.

Pilot version

While Pilot 1 uses a versioning system like <release>.<revision> (e.g. 73.2), Pilot 2 will use the format <release>.<version>.<revision>.<build> (e.g. 2.0.0.1 or 2.0.0 (1)). In this scheme, the <release> (not to change for Pilot 2) introduces major new features and involves significant internal rework/re-architecting. A new <version> adds new features and fixes assorted bugs. It generally does not include major internal design or architecture changes and should be (mostly) backwards-compatible with other versions of the same release. The <revision> usually contains bug fixes and tiny enhancements and will for most releases stay at '0'. The <build> number is a sequential build number within a release. The <build> number is generally automatically increment by a Continuous Integration (CI) build process, although for Pilot 2 this is not implemented and is currently manually set for each build. Only one build will be the official build for a given revision. [To be confirmed]

It is the responsibility of the release managed to update the Pilot version string.

Error codes

Pilot 2 error codes are largely inherited from Pilot 1, but with a few new ones as well. See the Pilot2ErrorCodes page for the current range of implemented error codes.

Pilot testing

The pilot is tested in several layers before a release. This document describes how this is done.

Signal handling

Both Pilot 1 and 2 can trap the signals in the list below. When such a signal is received, the Pilot aborts the job, data transfer or whatever the Pilot is doing at the moment, and informs the server with a corresponding error code.

Signal Pilot error code
KILLSIGNAL 1200, Job terminated by unknown kill signal
SIGTERM 1201, Job killed by signal: SIGTERM
SIGQUIT 1202, Job killed by signal: SIGQUIT
SIGSEGV 1203, Job killed by signal: SIGSEGV
SIGXCPU 1204, Job killed by signal: SIGXCPU
SIGUSR1 1206, Job killed by signal: SIGUSR1
SIGBUS 1207, Job killed by signal: SIGBUS

MiniPilot (deprecated)

A MiniPilot has been developed. The MiniPilot will be used primarily during the implementation and testing stage of the Pilot 2 Project. A description with instructions how to use it can be found in the corresponding MiniPilot GitHub.

Movers

The copytools or movers (formerly known as 'site movers') are responsible for transferring input and output from/to mass storage. This is the current list of supported movers:

  • Rucio mover (Pilot 1/Pilot 2)
  • Gfal-copy mover (Pilot 1/Pilot 2)
  • Local mover (Pilot 1)
  • xrdcp mover (Pilot 1/Pilot 2)
  • objectstore mover (inheriting from rucio site mover) (Pilot 1)
  • dCache mover (soon to be deprecated?) (Pilot 1)
  • lcg-cp mover (soon to be deprecated?) (Pilot 1)
  • cp/mv/ln mover (Pilot 2 - Pilot 1 supports cp, ln via storm mover)

The Pilot 1 mover architecture is described here. To be documented for Pilot 2.

Event Service

The Pilot 2 ES implements the Event Service processing.

Job Report processing

The job report (produced by TRFs) processing in the Pilot is described here. The Pilot always processed the job report from production jobs (i.e. TRFs always produce this report), and in the case a user job produces it (which it does if it is using a production TRF).

Rucio traces sent by the pilot

The pilot sends detailed information about file transfers to Rucio. Here is a list of the different fields contained in the trace report.

Pilot 2 usages of schedconfig

This is the documentation of all schedconfig fields used by Pilot 2.

Information sent to PanDA server

The list of all dictionary fields sent to the PanDA server during job updates can be found here.

Direct access and the Pilots

An explanation how the different pilot versions handle direct access.

Pilot timing measurements

The pilot sends a timing string to the server during the final job update with the following condense format:

pilotTiming time_getjob time_stagein time_payload time_stageout time_total_setup

Times are measured in seconds. The measurements all mean the time it takes for the operation in question;

  • time_getjob: time for getJob curl operation to finish.

  • time_stagein: time for entire stage-in to complete, including replica lookup. Note: the pilot cannot measure the time for direct i/o as this operation is handled by the transform.

  • time_payload: time for payload execution. Note: this includes any pre- or post-processing.

  • time_stageout: time for stage-out to complete, including log transfer.

  • time_total_setup: the total setup time is the time measured from pilot startup to the get job operation. During this time the pilot downloads queue data, checks the proxy lifetime, etc.

The timing information can be extended if necessary to measure all operations.

CPU timing measurements

The Pilot reports CPU timing information on every server update (getJob and updateJob). The measurements (system+user time for all child processes) are done during running approximately once a minute (using /prod/pid/stat) and a final measurement done immediately after the payload has finished (using os.times()).

Given an initial t0, user+system time is calculated like so:

t1 = os.times()
user_time = t1[2] - t0[2]
system_time = t1[3] - t0[3]

The instant CPU timing calculation extracts the system+user time from /proc/pid/stat for a given pid (using os.sysconf_names['SC_CLK_TCK'] for conversion) and loops over all child process stat files.

For technical details, see processes::get_cpu_consumption_time() and processes::get_instant_cpu_consumption_time().

Task and Wish list

A collection + discussion of special requests, wishes, suggestions, recommendations, etc. can be found here. This list will also contain relevant feature requests for Pilot 1 that arrive during the development of Pilot 2.

Pilot Release information

Detailed release information is located in the GitHub releases tab.

Project Guidelines

Pilot 2 guidelines are located here.

Activity Reports

The activity reports show the status of the work in progress to achieve the milestones (as of end of September 2016).

Source area

The Pilot 2 development area is located in GitHub. The current production version of the PanDA Pilot is located in a different GitHub repository.

Meetings and Conferences

Links to meetings and conferences with Pilot related presentations.


Major updates:
-- PaulNilsson - 2016-03-11



Responsible: PaulNilsson

Edit | Attach | Watch | Print version | History: r54 < r53 < r52 < r51 < r50 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r54 - 2021-07-08 - PaulNilsson
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback