CMS Multicore job scheduling project

Goal

In order to prepare for the second LHC run, CMS software developers are adapting our reconstruction and simulation code so that jobs may be run as multithreaded applications. The main gain from multithreading should be from reduced memory consumption, which would otherwise be a limiting factor given the increased event complexity. It also provides advantages in the scalability of the workflow management and pilot submission systems. In order to be able to schedule these new type of jobs, CMS will be using multicore pilots with dynamic partitioning to allocate jobs (single and multicore) at T0, T1s and T2s, in each of their respective use cases.

The initial target timeline for the project included the use of multicore pilots to manage 100% of the resources at the sites, starting with T0 and T1s in 2015. The project now aims at providing the full needed functionalities, namely access to the computing resources at the required scale, as well as monitoring tools, as well as the full deployment to all supporting sites during 2016.

Advantages of a 100% multicore pilot model

-Scheduling prioritization: CMS would gain total control on the scheduling priorities of single and multicore jobs, from production to analysis jobs, in line with the move of the submission infrastructure to the unified Global Pool. The argument is that CMS glideinWMS global pool is the best place to prioritize workflows and optimize the use of the resources according to CMS needs.

-Avoid single and multicore pilot competition for resources at the LRMS: both single and multicore pilots are submitted under the same user, sites are not able to provide a fair share based allocation to each pilot category. As single core pilots are naturally easier to schedule, they would tend to exhaust the global share of CMS at the site, so little or no multicore resource allocation is achieved.

-Avoid single and multicore pilot competition for workload in the glideinWMS negotiator: this effectively reduces the overall occupancy of slots in multicore pilots, unless the system is running in saturation mode.

-Reduced slot negotiation times: from the reduction in the number of pilots we need to control the whole system, which improves scalability by decreasing the numbers of jobs and pilots.

Challenges of the multicore pilot model

-CPU wastage at draining pilots: if multicore pilots are used to allocate resources to multiple single core jobs (multicore pilots are truly needed only for multicore jobs), when pilots drain their resources before their lifetime ends some CPU time is wasted, hence this drainage period must be minimized.

-Slower ramp up compared to single core pilots: increasing the level of multicore allocated resources at any site might be much slower than in the case of single core slots, as sites need to create multicore slots by draining their WNs. Typically a site would protect its farm to not drain more than a few percent of the resources at any given time, to avoid low average farm occupancy. This can potentially cause slow ramp up of available multicore slots.

-Fragmentation: mixing single core and multicore payloads in the same pilots causes the fragmentation of the internal slots which should be avoided in order to provide enough resources as multicore slots, preventing them from being taken by single core jobs (fragmentation). The problem of fragmentation should be solved by the renewal of pilots of finite life. As new pilots start with non-fragmented resources, they should pull multicore jobs first. No additional defragmentation should be needed.

Validation of the model

The validity of the model should be tested by measuring CPU usage and controlling the fragmentation of resources. Some fallback strategies include:

- Using fixed-slots multicore pilots (a fraction assigned to each multicore and single core jobs). Would solve the defragmentation problem, still would suffer from draining.

- Dynamically induced defragmentation of resources during pilot lifetime, to increase turnover of internal slots from single core to multicore jobs.

- Using separated single and multicore pilots for each type of jobs, as ATLAS does. Would solve the defragmentation and draining problems, however CMS infrastructure would not get the benefits of the reduced number of pilots.

Deployment at Tier2s

The priority for provisioning multicore resources during 2015 was T1s, expected to run PromptReco in multithreaded mode supporting the T0 on that task. With respect to the T2 level, the urgency for the deployment has arrived in 2016. The transition of Monte-Carlo production to multithreaded is foreseen to proceed in two steps, which also drives the current plan of deployment to T2s during 2016 (see slides)

-DIGI-RECO should be ready by Spring for the pre-ICHEP campaing (April to June). T2 sites producing DR should be moved to multicore first. This would allow CMS to double its available multicore resources from that of the T1s alone.

-GEN-SIM: to be produced by multithreaded jobs for the post-ICHEP campaign. The rest of T2s should be ready by October 2016.

Instructions for sites for the deployment of multicore resources at T2s.

The WLGC Multicore Deployment Task Force

The WLCG multicore deployment TF is the forum for interaction with sites and also the other main user, ATLAS. The main topic is understanding how VOs workload submission patterns interact with local job scheduling algorithms, in order to make the most efficient use of the available computing resources. We are participating in this group, although our impact has been and will continue to be quite limited until we really start doing tests step 2, running workflows in real-life conditions at some sites, preferably shared with ATLAS.

We should also take feedback from sites in order to tune our system for the best results. For some preliminary thoughts on how to tune our job submission system see https://indico.cern.ch/event/306792/contribution/6/material/slides/0.pdf

Currently open issues and action items (Feb. 2016)

CMSSW

-Understand which types of jobs will be run multithreaded and when. For 2016, the strategy has been defined in terms of running data (RECO) and MC (DIGI+RECO) reconstruction in multithreaded mode, starting with the 2016 data taking and the MC production campaigns before ICHEP. Post-ICHEP, MC production (GEN-SIM) should be also moved to multithreaded. This drives the need for multicore resources available at T2s.

Multicore job injection

-To validate the performance of the "100% multicore pilot" model, in particular with regards to fragmentation, we need to load the system with multicore jobs routinely. In 2015, only experience was on mid March, whenT0 PromptReco jobs were allocated to T1s as an exercise. However, from December 2015, the reprocessing of 2015 datasets was performed in multithreaded, with apparently successful results.

GlideinWMS: FE, Factories, Negotiator

-Controlling pilot pressure: review front-end and factory configurations to avoid excessive or deficient pilot load at sites. From the factory side, considering a site's pledge, "PerEntryMaxGlideins" should be set keeping in mind the 8-core per pilot factor, and also the number of available entries per site and the number of running factories.

-Depth-wise filling of the multicore pilots: Concentrate the load in the minimum number of glideins (pilots) to avoid CPU wastage when not running in saturation mode.

-Learn to configure multicore pilots lifetime parameters (glidein max wall time, glidein retire time, glidein retire time spread, job walltime, claim worklife, job ranks, etc ) to tune their performance in terms of efficient CPU usage

-Understand if mcore pilots are needed with the "pilot role" or not while we keep the 95% prod-5% pilot resource share policy for T1s.

Workload Management

-Can we switch multicore requirement on and off for different tasks in a taskchain, (gen-sim-digi-reco) if only some of them can be efficiently run multithreaded? The trade-off between CPU inefficiency and increased overheads plus additional scheduling complexity needs to be evaluated

Dashboard monitoring

-Use number of cores to correctly compute CPU efficiency and classify jobs. This involves both the historical view and the interactive view. Both are being worked on (https://cern.service-now.com/service-portal/view-request.do?n=RQF0548540)

Pilot monitoring deployment

An important part of the project is the development of the appropriate monitoring tools for CMS central submission infrastructure, but also from sites point of view. This should be used to identify sources of scheduling inefficiency in the pilots. Central monitoring is being developed at:

Also:

  • sites: additional pilot performance monitoring from the site side is definitely welcomed (work ongoing at PIC, KIT and FNAL)

  • dashboard: the number of cores used by each job is reported to the dashboard. Then we can filter jobs by that category (single core vs. multicore) as well as scale walltime with NCores and calculate correct efficiency values, in the 0-1 range.

Activities, Steps and timeline (inverse order)

See activities and timeline for the project

Related information:

The status of the project, along with technical matters, is frequently discussed in the CMS Submission Infrastructure meetings, (Thu. 5 pm CERN time). For site related topics and the interaction with ATLAS, the usual forum are the WLCG multicore deployment task force meetings.

News in the Computing Project Office

Slides related to the project:

2016:

2015

2014:

2013:

Additional documentation:

Twikis for CMS GlideinWMS configuration:
* https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsGlideinWMSUsage
* https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsWMSDeploy

WLCG Multicore deployment task force: * https://twiki.cern.ch/twiki/bin/view/LCG/DeployMultiCore

List of CMS computing projects: * https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompProjOffice

GlideinWMS website: * http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html

-- AntonioPerezCalero - 18 Mar 2014

Edit | Attach | Watch | Print version | History: r72 < r71 < r70 < r69 < r68 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r72 - 2016-05-17 - AntonioPerezCalero
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback