Motivation
The LCG RB (both EDG and gLite WMS) has a size limitation on job's inputsandbox (by default is 10 MB). Jobs with oversized inputsandbox will encounter job submission failures. Current workaround for ATLAS is to configure RBs to support up to 50 MB inputsandbox.
To overcome this limitation, a concept of inputsandbox cache is introduced in the LCG backend handler. The idea is to pre-upload the oversized inputsandboxes to SEs when preparing the job and to use the job wrapper to download them on the fly before launching the real executable on the worker node.
Implementation
A loop has been added in the
jobprepare()
method of the LCG handler. In the loop, the size of each input sandbox is checked before composing the job wrapper and
JDL
.
If the sandbox doesn't exceed the limitation defined by
config['LCG']['BoundSandboxLimit']
, it will be attached with the job and shipped through the resource broker; otherwise the sandbox is uploaded to a remote storage element before job submission. In the later case, the reference to the pre-uploaded sandbox (e.g. GUID or URI) will be given to the job. The reference is also logged in
job.inputdir/__iocache__
for other management purpose (e.g. cleanup). If the oversized sandbox is shared among sub-jobs, it will be uploaded only once.
For transferring the oversized input sandboxes, a
GridCache
class is introduced to provide there basic methods,
upload()
,
download()
,
delete()
. All the methods are implemented with a retry mechanism. Current implementation wraps the lcg-utility commands to perform file transfers on the grid.
Usage
Although this feature is automatically applied when the LCG handler detected an oversized input sandbox, there are ways to configure (or disable) it:
- force to enable/disable the feature: set
config['LCG']['BoundSandboxLimit']
to a very small/large value (in byte)
- force to use certain storage element: set
config['LCG']['DefaultSE']
or j.backend.iocache
(j.backend.iocache
takes precedence)
As the current implementation uses LFC, the
LFC_HOST
env. variable is automatically detected using the
lcg-infosites
command. The
config['LCG']['DefaultLFC']
can be given as a backup setting if the
LFC_HOST
cannot be obtained. The value used to upload the oversized input sandbox will be set as an "environment variable" in the
JDL
to ensure the same LFC will be used by the WN to download the sandbox.
In addition, a method for cleaning up the uploaded files is also exposed to users. For the job at a final state (e.g. completed, failed), one can call
j.backend.cleanup()
to manually remove all the uploaded input sandboxes associated with the job. For instance, the LCG handler doesn't cleanup them automatically.
SRMv2 space token
As the SRMv2 was currently adopted by HEP experiments for managing the data stored on distributed storage elements, a specific srmv2 space token needs to be specified in uploading oversized input sandbox to avoid miss-using the storage technology (e.g. you don't want the input sandbox being staged into tape). Starting from Ganga 4.4.10, one could specify the space token in the following two ways:
- set
config['LCG']['DefaultSE']
in the syntax: token:<TOKEN_NAME>:<SE_NAME>
- or set
config['LCG']['DefaultSRMToken']
If both are set, the first setting takes precedent.
Known issues
* For gLite bulk job, the WMS sets the sandbox restriction on job collection. The LCG handler checks the sandbox size only on each individual job.
-- Main.hclee - 24 Jan 2007