Commonalities in pilot frameworks

Present status

The four LHC experiments have independently either developed or integrated pilot frameworks basically with the following intents:

1) implement late binding, that in turn fixes these two issues:

  1. frustrating attempts to deal with black-hole worker nodes and the inefficiencies of the Information System, part of the standard grid middleware stack;
  2. enable intra-VO priority throttling;

2) overcome the failures of the early versions of the middleware itself;

3) implement a centralized workload management system able to guarantee VO-wise fair sharing.

Other than mere workload management, frameworks were also born to manage high-level workflows from their users and have direct control on how their are translated into grid jobs, how data and software are distributed and so on. Actually some frameworks are as well able to use early-binding, so they are not strictly related to pilots. In any case, this upper layer is something more specific and none among the experiments showed interest in finding common solutions.

Back to workload management, it has to be noted that support for late-binding is lacking in the present grid middleware, so that now it is implemented using the classic one-way, push-mode submission model the present grid middleware provides (either sending pilots directly to the CEs or through a grid submission layer that abstracts the CEs) and an extra encapsulation (the pilot itself) to provide an overlaid network of Grid nodes. This approach is not seen as problematic, but sometimes can create too much pressure on the infrastructure. All frameworks agree that this can be optimized by creating streamed requests of pilot jobs (see Streamed Submission) to be supported by the CE.

Unfortunately there are actually problems in this approach, and they all have to do with the security implications that another similar group is trying to work out. The fundamental issue is that the identity of the job that authenticates, is not the identity of the job that runs. This both requires an identity switch (fair enough, but at the moment the site has no control/ no way to enforce it) and the handling of the end user credentials directly from the WN in some cases. So far, these security implications have been overlooked mostly because the advantages of late binding still exceed their disadvantages and no better solution has been proposed until now.

Commonalities in the Grid submission layer

While CMS has in glideinWMS a solution fully relying on Condor, PANDA uses Condor basically for abstracting the Grid submission layer and so the various CE implementations. DIRAC and ALIEN have developed their own workload management solutions, that can use direct submission to the CREAM CE or to the gLite WMS. During the course of the meeting, ATLAS communicated its intention to use glideinWMS to handle Grid job submission and deal with glexec without having directly to worry about that.

Streamed Submission

All VOs expressed interest for an extension of the CE interface to be able to: constantly keep N jobs queued at a given queue until a given condition is satisfied (e.g. no work to be done). This would avoid the need for a VO pilot factory at sites that was initially proposed by some experiments to avoid too many authN/authZ operations on the CE and make pilot execution more efficient. VO-specific pilot factories were considered difficult to sustain, especially for multi-VO sites. For those who don't, pilot stream would be beneficial anyway, to decrease the 'pressure' from the framework to the sites. Just like any other requests for submission, a 'job stream' request will have a JDL expressing requirements, shared among each single job produced, and associated operations to list, kill and even update the submission frequency. Care, in fact, needs to be taken not to overflow that batch system queues, especially when on the framework side there is no or little work to perform. This requires a dynamic control over these job streams. CREAM developers considered this feasible. The TEG approved this activity.

Common job wrapper for pilots

As said, there are currently potential issues in the way the pilot wrapper retrieves the payload user credentials and calls glexec or similar mechanisms to switch identity and execute the user payload under the user UID. These critical steps require reviewing each and every framework anytime something is changed. For these reasons, it was proposed to find commonalities in the pilot bootstrap scripts to factor out this part and in particular to enforce the identity switch. This has to be further investigated before decisions are taken.

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2012-02-27 - ClaudioGrandi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback