WLCG Multicore Deployment Task Force

Mandate and goals

In the scope of WLCG Operations Coordination, the Multicore Deployment Task Force pursued these objectives:

  • collect information from the experiments on their model for requesting and using multicore resources accessible at WLCG sites via traditional computing elements
  • agree on a simple but efficient (set of) multicore resource provisioning model(s) that sites can adopt with reasonable effort. Pilot the model(s) at some selected sites and evaluate them based on metrics taking into account experiment workflow efficiencies and impact on sites.
  • define a deployment strategy for the migration of an increasing amount of resources to multicore queues according to the needs and priorities of the experiments and the sites
  • drive the deployment process

The task force in addition investigated accounting issues on multicore resources in collaboration with the Machine/Job features task force.

Although the task force is not asked to provide a solution for cloud resources, the proposed solution for Grid resources have the potential to be extended to cloud resources.

In addition, this TF should ensure proper coordination with the work done in WLCG Cloud Working Group and WLCG Operations Coordination Machine/Job Feature TF on aspects of its work that may overlap.

The coordination of the task force will be shared by Alessandra Forti and Antonio Pérez-Calero Yzquierdo.

Contact

All members of the task force can be contacted at wlcg-ops-coord-tf-mc@cernSPAMNOTNOSPAMPLEASE.ch

wlcg-ops-coord-tf-mc egroup page and membership

Task Tracking and Timeline

The task force coordination has proposed to adopt October 2014 as the aim date for a working system to be fully deployed. This should comply with the need to make sure that everything is ready for the LHC re-start on 2015. The involved experiments, ATLAS and CMS so far, agree on this schedule. The table below presents a more detailed plan of tasks

1 Task name Deadline Progress Affected VOs Affected Sites Comments
2 CMS scale tests - Ongoing CMS CMS T1s (prod) & T2s (analysis) CMS is deploying multicore pilots for production jobs at T1s and for analysis jobs at T2s. The objective is to get CMS sites exposed to dealing with multicore jobs. In order to proceed incrementally, single core and multicore pilots are submitted in parallel
3 ATLAS deployment of a solution for passing job parameters to batch systems - Ongoing ATLAS - -
4 Dynamic resource provisioning recommendations - Ongoing CMS and ATLAS All sites Sites mostly experienced with ATLAS multicore jobs so far. Starting to gather information related to CMS as well. Need to review the situation when CMS and ATLAS are both submitting multicore jobs to shared sites
5 Collecting batch systems configurations - Ongoing - - See below. LSF/SLURM not covered yet

Documentation

ATLAS

CMS

CE related information

  • CREAM example from NIKHEF
  • If using a different batch queue, remember to add the queue info to the CE information system, e.g. if using YAIM update QUEUES and QUEUENAME_GROUP_ENABLE
  • Concerning accounting:
    • Publishing multicore accounting to APEL works. ARC CEs publish correctly. For CREAM CEs to make it work it has to be an EMI-3 CE and it has to be enabled in the configuration.
    • Edit /etc/apel/parser.cfg and set the attribute parallel=true.
    • If the site was running multicore already, before upgrading and/or applying this modification, they need to reparse and republish the corrected accounts.
    • See slides from APEL team on December 10th GDB

Batch system related information

See also batch system comparison

Deployment table

Experiment T0 T1 T2 opportunistic Applications Time
Alice NA NA NA NA NA NA
Atlas yes yes yes yes reco,simul,reprocessing,pile,HLT running since 2014
most sites configured by March 2015
deployment tail completed in September 2015
CMS yes yes yes, on going yes SIM, DIGI, RECO, HLT T1 deployment completed by May 2015
T2 ongoing for 2016, as required by the experiment
LHCb ? ? ? ? ? ?

Recurring Questions

As exposed in Thomas Hartmann presentation

  • If scheduler going to utilize wall time prediction needs HEP-SPEC06 secs how exact have HS06 to be for a VO/for a site?
  • HS06 scores are designed to scale with the average performance of typical HEP job mix. Be aware that there is absolutely no warranty that it scales with every individual job!
  • How are inefficiencies being accounted? no official requirements on WLCG sites
  • Is it reasonable to account a VO when a job’s run time deviates by from the prediction and spoiling the scheduler?
  • Currently no wall time prediction provided per WLCG job no duty of VOs to supply wall time predictions precise enough to avoid gaps/optimize scheduling
  • Is it accountable to the VO submitting a mcore job?
  • Is it solemnly a site issue?
  • How large is the effect in the end?
  • How many ramp-up periods for how long with steady stream of mcores negligible after x?
  • System states with efficient utilization of bare metal?
    • High entropy: many(?) short(m, h, ?) jobs filling free slots what max. wall time prediction variance necessary for good scheduling? (necessary at all?)
    • Low entropy: long(h, d, ?) with accurate wall time estimation what max. wall time prediction variance necessary for good scheduling?
  • Sites with mixed VO users
    • Stable mixed state possible? or implicit/explicit segmentation inevitable?
    • How efficient could an ideal scheduler be under which variances in which state? i.e. how large would the inefficiency in node utilization become under which variance? if the wall time prediction is binned, i.e., in finite numbers of attainable run times/queues, how does the efficiency evolve with the number of prediction time slots

Task Force and related meetings:

Additional sources of related material

-- SimoneCampana - 27 Nov 2013

Edit | Attach | Watch | Print version | History: r61 < r60 < r59 < r58 < r57 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r61 - 2016-04-12 - AntonioPerezCalero
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback