Jobs requiring whole nodes or multiple cores

Introduction

Traditionally, WLCG experiments have been running their workloads using one core per job. This normally applies to both early- and late-binding jobs. However, WLCG experiments have for some time required the possibility to run their jobs on more than one core. See, for example, the presentation made by Ian Fisk at the December 2011 GDB [1].

The general observation is that the aggressive parameters foreseen for the LHC in 2012 will push pileup much higher than in 2011. This may lead to issues related to memory consumption; these issues can be mitigated if memory is shared across several cores on worker nodes.

Therefore, this TEG analyzed possible strategies for implementing multi-core access for jobs taking into account requirements from both experiments and resource providers.

Whole nodes and multi-core job requirements

Currently a few sites are running prototypes to support multi-core support through dedicated “whole-node” queues, i.e. entry points where experiments would send jobs to, with the assurance that the jobs would run on hardware exclusively allocated to them. This solution has a few advantages: once a job has landed on a given node, it can simply look up how many cores there are, how much memory, etc.; the job will be able to manage the resources simply using the standard OS tools, without having to care about other jobs being running on the same node.

However, from a site perspective, this solution often leads to resource partitioning: some resources are allocated to one or more “whole-node” queues, and cannot be used (without greatly sacrificing resource exploitation) by traditional single-core jobs. This partitioning may also be aggravated by the fact that systems have an increasingly number of CPU cores; therefore, a single system accounts for a relatively larger share of resources.

Several sites in general, and sites serving more than one experiment in particular (maybe non-LHC experiments as well), rather prefer dynamic resource sharing making use of the appropriate scheduling mechanisms provided by the site LRMS (Local Resource Management System, or batch system: e.g., PBS, LSF, SGE, Condor). It will be anyway a site decision whether to offer a "strict whole node" solution, or rather generic multi-core access.

Sites also expressed the requirement that any solution offered should not unnecessarily complicate the (many times already complex) site configuration. In particular, any solution should not require sites to deploy configurations that are “unique” to WLCG; on the contrary, any solution should also be applicable to other communities in case there a similar need arises.

Finally, any solution should be applicable to either early- or late-binding jobs or workloads.

Multi-core support proposal

A solution geared to flexible exploitation (by both experiments and sites) of resources and that does not imply resource partitioning requires that details about the multi-core request reach the site LRMS. We propose that these details are expressed in the JDL associated to the job.

In particular, we would like to support requirements to specify:

  • the number of requested cores;
  • the requested total memory for the job (or the memory per core if the requirement on number of cores is not exactly specified);
  • whether a job wants a whole node or not.

We also foresee that jobs may only specify the minimum and/or maximum number of cores; within these limits the job will be able to run on any number of cores made available to it. How to deal with memory requirements in this case needs to be defined.

There are other requirements that might be useful for jobs like, for example, the request for a minimum amount of local disk space. At this stage, however, we decided to restrict requirements to number of cores and memory.

CE vs. local policies

We note that advance reservation policies, if desired, should be implemented by sites, not by CEs. Similarly, if sites wish to enforce memory limits for jobs, these should be implemented by the sites themselves taking into account the memory requirements specified by the jobs or, in their absence, a default.

From a deployment point of view, it may be interesting to note that some sites may be currently adopting memory limits on a given queue or set of queues, assuming that jobs submitted to that queue(s) are single-core jobs. If the same queues are used to also support multi-core jobs, memory limits should be rescaled by the number of cores allocated.

Multi-core support by middleware

CREAM

The CREAM CE already supports some JDL attributes allowing resource requirements in jobs [2]. This implementation was prompted by requirements put forward by communities interested in Grid MPI support. In summary, some new JDL attributes were introduced, specifying the request for a total number of CPUs (CPUNumber), the requested minimum number of cores per node (SMPGranularity), the total number of required nodes (HostNumber), and whether a whole node (all cores) is required or not (WholeNodes). Some examples of the semantics and experiences are described in [3].

Not all the JDL attributes mentioned above are needed to support multi-core jobs in the present context (for example, there is currently no requirement for jobs spanning multiple nodes). At the same time, there are some new attributes that should be introduced:

  • A minimum and/or maximum number of cores may be specified instead of the exact number;
  • The total requested memory (or the memory per core in case the exact number of cores is not specified) may be specified.

In particular, some hints on memory requirements should be given. We therefore propose that a new attribute be defined, specifying how much memory is required for a job.

This attribute should be translated by CREAM CEs into commands similar to:

qsub -l mem=XXXmb for PBS, or #BSUB -R “rusage[mem=XXX]” for LSF

From an implementation point of view, the following points should be addressed by CREAM:

  • SGE support is currently missing
  • memory requirements should be supported

The CREAM developers agreed to consider implementing these changes (the manpower estimate to implement the memory requirement is about one week). It will have to be discussed with the EMI project how their existing workplan can be modified to include the new requirements, to see when and how these changes can be implemented and released.

GRAM

GRAM already supports the specification of resource requirements in jobs, through the use of RSL (Resource Specification Language)[6]; indeed, it has supported it for the past 10 years or so. The WLCG community just did not seem to use this functionality extensively. Regarding multi-core requests, the RSL attribute xcount was defined by Teragrid [7] to support the MPI community.

In OSG, work on providing whole nodes through GRAM is documented in [4]. Please notice that while the OSG activity is concentrated on "whole nodes", some of the proposed solutions are actually just "multi-core". In particular, pbs sites do not have native "whole node" support, so the OSG team settled on "multi-core as large as the nodes" instead. For Condor and LSF, instead, true "whole node" semantics is used.

Going back to multi-core, the RSL parameter to specify multi-core jobs (e.g. 8 cores) is:

(xcount=8)(host_xcount=1)

Now, said that, the jobmanager must be able to translate the multi-core request into the local batch system requirement:

  • As mentioned above, PBS is not a problem.
  • LSF and SGE support is currently unknown.
  • Condor for sure currently does not support it.
Nevertheless, adding support for a new feature in the GRAM jobmanager is pretty trivial, as long as the underlying batch system supports it.

That was for a request of a specific number of cores; ranges of cores are currently not supported. As above, adding that should be pretty trivial in the jobmanagers, if the underlaying batch systems support it; we would just need to agree on the syntax and semantics of the attribute(s).
Please note that interest in this functionality from OSG sites is currently non-existent.

Regarding other requirements, the attributes currently in use in OSG are:

  • maxWallTime
  • maxMemory
  • minMemory

For all other needs, the following low level attributes can be used:

  • condorsubmit
  • gluerequirements

ARC and EMI

Both ARC-extended RSL (xRSL) and the standard JSDL allow to specify number of cores per job.

JSDL also includes a possibility to request a whole node allocation. This is meant to be supported not just by the JSDL, but also by the EMI-ES compatible services, which includes not just ARC but also CREAM.

In addition, the flexible mechanism of runtime environment is already available and used by multi-core jobs in a variety of MPI frameworks, but not by the WLCG community, for reasons unknown. Use of runtime environments is highly recommended.

Multi-core support by experiment frameworks

Both ATLAS and CMS stated they can work on their respective frameworks to accommodate arbitrary N-core job sizes, and that the resource requirements can be specified using either JDL (for CREAM) or RSL (for GRAM).

ALICE and LHCb … [TODO: interest for multi-core support by ALICE and LHCb]

Testing

We foresee a two-phase approach for the test.

  • In the first phase, experiments will require either whole nodes, or an exact number of cores for their jobs, using the JDL or RSL related mechanisms specified above. To simplify things, we will initially support jobs requiring number of cores = 4.

  • In the second phase, experiments may also require a variable number of cores, with the understanding that a job will be able to utilize as many cores as are made available to it.

To implement the second phase, we propose to define an environment variable, telling the job how many cores (and how much memory) the job has actually been allocated. In the case of whole nodes this is not necessary (since a job might simply query /proc/cpuinfo to get the total number of cores). However, a multi-core job may also run in a shared system; in that case a mechanism is needed to inform the job of the number of cores.

Neither the CREAM CE nor the GRAM-based CEs currently support the specification of a range of cores (e.g. something like "at least N cores"), so they would need to be modified to support this use case.

A similar mechanisms to communicate other types of information to a job was proposed by the HEPiX-virt WG [5]; that mechanisms could be easily extended to also pass the amount of allocated cores and memory to jobs.

The following sites and experiments have agreed to participate to the initial testing phase:

  • CMS: modify WMagent / glideinWMS to support multi-core scheduling according to this proposal (phase 1)
  • Potentially all sites currently supporting the whole node testing for CMS plus GRIF will join the test.

More CMS sites are expected to join.

[TODO: others sites/experiments for the testing phase?]

Other considerations

The proposal described here may interact with other subsystems or activities. We underline, in particular, the following:

Information System: for the model where jobs require an exact number of cores, we need to define a way for sites to specify the maximum number of cores that can be supported. We should probably also specify in the Information System whether a site only accepts pure whole node requests, or generic multi-core jobs.

Virtualization: whole nodes may be provided through virtualization techniques. However, this should be transparent to users. More details about virtualization are provided in the relevant section of the activities of this TEG.

Notes

[1] CMS Ops Report, December 2011 GDB, http://indico.cern.ch/getFile.py/access?contribId=8&sessionId=3&resId=0&materialId=slides&confId=106651

[2] CREAM User's Guide, Submission on multi-core resources, http://wiki.italiangrid.it/twiki/bin/view/CREAM/UserGuide#3_1_Submission_on_multi_core_res

[3] TheoMpi: a large MPI cluster on the grid for Theoretical Physics, https://www.egi.eu/indico/getFile.py/access?contribId=65&sessionId=18&resId=0&materialId=slides&confId=207

[4] OSG High Throughput Parallel Computing, https://twiki.grid.iu.edu/bin/view/Documentation/HighThroughputParallelComputing

[5] Tony Cass, Communicating Machine Features to Batch Jobs, MB March 2011, http://indico.cern.ch/getFile.py/access?contribId=6&sessionId=1&resId=1&materialId=slides&confId=106643

[6] The Globus Resource Specification Language, http://www.globus.org/toolkit/docs/2.4/gram/rsl_spec1.html

[7] Teragrid, https://www.xsede.org/tg-archives

-- DavideSalomoni - 01-Feb-2012

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2014-09-09 - AlessandraForti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback