Summary of pre-GDB meeting on Batch Systems, March 11, 2014 (CERN)

Agenda

https://indico.cern.ch/event/272785/

Introduction - M. Jouvin

Batch system landscape in our community has diversified

  • Grid Engine adopted by several T1s or large T2
  • SLURM and HTCondor emerged as alternatives to Torque/MAUI and LSF

New challenges:

  • Multicore job support a high priority
  • Ability to add/remove nodes from cluster seamlessly (for clouds)

Meeting goals:

  • Reports from sites experience
  • Identify strengths and weaknesses
  • Review missing bits for job submission, mgmt, accounting, monitoring

Batch System Review

Torque/MAUI

NIKHEF + PIC experience - C. Acosta (PIC)

  • Torque/MAUI has usual batch system features, including fairshare, backfilling, SMP and MPI
  • Capable of handling multi-core jobs
  • Both sites running Torque 2.5.13 and MAUI 3.3.4
  • Both sites multi-VOs
    • NIKHEF : WLCG is only 50% of the users, 3800 job slotss running, 2000 jobs waiting, 97% average usage in last 12 months
    • PIC: mostly WLCG, 3500 job slots, 2550 waiting jobs, 95% average usage in last 12 slots
  • Torque has a very active community + free support from AdaptativeComputing
    • At least one realease per year, frequent patches for critical problems
    • Easy to configure
    • 2.5.x is EOL for quite some time: recommended version is 4.2
    • Scalability issue reported for 2.5.x but not at our scale: 4.2.x should present significant enhancements for scalability
  • MAUI: no longer supported by AdaptativeComputing, poor documentation, clumsy error messages
    • Scalability issues: basically hangs at 8K jobs. Setting MAXIJOBS (max number of jobs to consider per scheduling pass) may help.
    • Non trivial configuration
    • Moab is the commercial alternative intended to address scalability issues: good feedback from sites which used/tried it
  • Torque/MAUI scalability issues only relevant for large sites and may be alleviated by moving to multicore jobs (less jobs to handle)
  • Possible options for the future
    • Move to Moab but not free!
    • Start an openMAUI project
    • Change for another system but benefit not really clear: will require training of local users
    • PIC and NIKHEF have not faced yet major issues

Discussion

  • Several concerns about the impact of running unsupported SW with respect to policies and best practices
    • NIKHEF: (EGI) security experts on site never complained about running MAUI, when there is a security issue fix it or mitigate it
    • EGI (Peter) supports that this is a concern that needs to be addressed at some point, even if there is no urgency. NIKHEF has no commitment to do it for the community, even though it tends to do it.
  • MAUI scalability can be greatly affected by scheduling parameters chosen
  • Torque 2.5 has a problem with WNs which die: can kill Torque.
  • Why not to run recent Torque (4.2.x): Jeff reports that no major problems are known, mainly the matter of configuring munge
    • Probably to be done as the EPEL 2.5 package is unmaintained (not only Torque itself)
  • Possibility to start a community around MAUI: unclear, not before a major issue requires it, depends on the features provided by MAUI alternatives
    • NIKHEF no yet convinced other batch systems provide the same set of advanced features or with a better stability

Grid Engine

GridKa experience - M. Alef

  • Running Univa GE since May 2012
  • Certificate Security Protocol enabled
  • Single queue for all VOs and all types of jobs: simplification compared to openPBS configuration
  • Fairshare configured according to pledges, based on WC time. GE support several fairshare strategies: traditional, functional (no history), override (high priority jobs)
  • Scheduler fast and support, very good support
  • documentation tends to be problematic: a lot of developers' slang...
  • Command line syntax is not always optimal, some options doing different things for the same command depending of the context
  • Multi-core job support enabled: used in production by Atlas since several weeks
    • No dedicated queue: job requests the number of cores in the JDL
    • No subcluster
    • Parallel Environment configured to support multi-core jobs
    • Memory limits set by profile scripts on the WN, according to the number of cores requested
    • Backfilling enabled but very few jobs declare their max wall time
    • Use of reservations to help with multi-core job scheduling: degradation of up to 1% of cluster usage
      • For an 8 core job, result in up to 7 idle single core job slots idle
      • Have to enable the use of reservation by multi-core jobs through a cron job running qalter: not done by default by CREAM
    • Multi-core job reservations are used by single core job if there are no more multi-core jobs
    • Observed wavelike submission of multicore jobs
  • Accounting: qacct doesn't scale WC time by the number of cores
    • Not a problem for accounting but for fairshare config based on WC time
    • J. Gordon: debatable if WC time must be scaled by number of core...

Discussion

  • Ulrich: did you consider using a short queue for the purpose of back filling
    • Manfred: no, because until now we saw everybody using the long queue...
    • Alessandro G: multiplying queue in PanDA is not ideal, prefer to pass WC time

CCIN2P3 - V. Hamar

  • Master using a PostgreSQL back end on a dedicated server
  • Master shadowing stopped after several master outage
  • WN : integration with AFS tokens using home-made development
  • Fair share on projects (200 projects, 150 groups): share tree + override tickets for particular use cases
  • Load sensors to sort WNs according to disk and memory usage
  • Separate subclusters for single core jobs and multi-core jobs
  • 18K job slots, 12K pending jobs
  • Grid integration: CCIN2P3-specific integration translating user requirements into GE requirements
  • Several monitoring tools developed locally
  • Multi-core job support: not very successful with coexistence with single core jobs, decided to have 2 sub clusters and adjust the number of WNs to the needs (24h reaction time)
    • Will revisit this in the future, in particular using advance reservations
  • Good support from Univa

Discussion

  • Why CCIN2P3 choose to have its own MW integration: because of specific needs, completely different from the community-provided one
  • Why to choose Univa GE rather than Son of GE: KIT and CCIN2P3 needed support
    • Look at status of various GE variants presented during HEPiX Fall 2013: Son of GE seems to be "dying" (at least almost inactive)

LSF

Experience @CNAF - S. Del Pra

  • Still running v7 (last version from Platform)
  • 10 CREAM CEs, 1400 WNs, 18K slots, 100 Kjobs/day
    • No resources dedicated to a VO
  • Fairshare, single-core jobs
  • SL6 only: SL5 run through WNoDeS
  • Virtual VM management redesigned for scalability using LSF external load indexes: auger early adopter
  • Small HPC cluster to be added soon to the cluster
  • Scalability issue seen at CNAF: too many requests to client leading to (outbound) network saturation on the master
    • In particular "bjobs -l" every 2 minutes by the CE: improved by bjobsinfo from CERN. Ulrich: there is a caching mechanism available for LSF GIP, used at CERN, with a dramatic impact on the load (information collected once and exported to all CEs)
    • DIRAC also issuing a good number of bqueues/bjobs command, even though not generation as much output
  • Accounting sensors removed from the CEs: grid records matched with LSF records offline, using a PostgreSQL db on a standalone server
  • Tool to automate kernel upgrades on WN between jobs, on a WN-by-WN basis
  • Use of LSF external load indexes to measure job packing on a node: helps to control packing similar jobs on the same node
    • Use by WNoDeS
    • May result in some unused slots
  • Multi-core job support currently only through a dedicated queue open to Atlas and CMS.
    • Dedicated set of WNs (10): WNs dedicated to multi-core as long as there are some in the queue. Works pretty well as long as there is only one type of multi-core jobs and a steady flow of multicore jobs.
    • Plan experimenting with external load indexes to improve flexibility: dynamic increase or shrinking of number of nodes reserved for multicore jobs
  • Backfilling available but not used as jobs tend not to define their max time: could use it if properly defined
  • Very good overall experience with LSF, scalability not a problem yet...

LSF used at several other INFN sites

  • INFN has a national license

HT Condor

RAL experience - A. Lahiff

  • Central managers on a high-availability pair, 3 ARC CEs, 2 CREAM CEs
    • HA as documented by Condor, no shared filesystem
  • CREAM compatibility: not officially supported but works with little effort
    • Job submission: BLAH supporting Condor
    • Wrote a script to publish dynamic info into BDII
    • Wrote a script to convert Condor accounting in PBS style accounting (text files)
    • RAL ready to share its developments: Cristina interested to see them integrated into the official releases
  • ARC CE succesfully used by ATLAS and CMS, better support of HTCondor
    • Some patches underway to allow DIRAC usage
    • ALCIE still has problem with ARC CE: work in progress
    • ARC accounting publisher (JURA) can send directly to APEL through SSM: no need for APEL node
    • At some point, would like to phase out CREAM CEs...
  • Configured through etc/condor/config.d: managed by Quattor, also Puppet modules available from the community
  • Condor has no concept of queue: jobs specify resources (cores, memory...) and requirements (OS version...)
  • Hierarchical fairshare available
  • Starting to experiment with cgroups to ensure a job doesn't exceed resources allocated to it
  • Scalability: saw no problem up to 30K simultaneous jobs
  • Good experience with free, community support
    • Experience with a couple of critical issues
  • Multi-core job support: several feature to partition resources, including WNs, condor_defrag to pack jobs on some nodes while draining others to increase the number of multi-core slots
    • Fully in production since Nov.
  • Dynamic join of the Condor cluster: no pre-defined list of WN
    • Make HTCondor very appropriate to use in an environment with virtualized WNs
    • Can even power up physical machines and shut them down when they are no longer needed

Discussion

  • Andrew Lahiff: RAL T2 and Bristol using Condor (with ARC CE), other UK sites considering it
  • Chris: 10 UK sites with Torque/Maui, half thinking to use Condor, half SLURM

SLURM

CSCS Experience - M. Gila

  • 2K job slots
  • Running SLURM since end of 2011 at CSCS and since Sept. 2013 in the T2
    • 2 month migration
  • Works very well with multi-core jobs
  • Good things
    • Easy to configure and deploy: RPMs easily built, one general configuration file + dedicated config file for other components
    • Accounting in a SQL DB
    • Scalability: tested successfully with 10k jobs
    • Lots of customization points: ability to hook in the job at different states
  • Dark side
    • Some versions very buggy...
    • SLURM accounting, fairshare and QoS complicated to setup
    • Accounting DB can become a bottleneck if not properly tuned
    • Command line syntax is not consistent across tools
    • No API to query SLURM: need to parse the command output (fragile)
  • Support: community support or commercial
    • T2 experience: no need for commercial support
    • Documentation is pretty good
  • CREAM CE: basically ok for job submission
    • Had to write/fix the info provider.
    • Cristina: fixes for SLURM info provider released in Nov. as part of CREAM and for BLAH in March: it seems it was missed by CSCS but everybody agrees to resynchronize
    • A bug found at the beginning but quickly fixed
  • ARC CE ok since the beginning
  • APEL accounting: many problems but should be fixed now in EMI-3, thanks to APEL support
    • Main source of pain for migration

NDGF experience - U. Tigerstedt

  • Scientific clusters in Finland are 100% SLURM, Sweden moving fast towards SLURM
    • Developers in all Nordic countries
  • SLURM very good with multi-core jobs if they speak MPI but requires some tweaking for non MPI jobs
  • Default scheduler can have problem with a large number of short jobs: more targeted at long lived jobs
  • Upgrades between major version (x.y to x.z) can be a hassle
  • SPLANK API allows to plug into the job at every stage of the job
  • Issues
    • No IPv6 support
    • All nodes need the same configuration file: no need to have it on a shared file system but need to have the same contents (same checksum) on every node
    • Preference for a shared file system
    • Mixing multi-core and single-core jobs can mess up the default fairshare + backfilling combination

Discusssion

  • Is SLURM fully supported by the MW? Yes (John + Cristina)
    • Bugs must be reported and fixed
  • Stability problems faced at RAL: seem to be related to the version used at the time of testing, current SLURM version (2.6) should have improved the things a lot
  • No IPv6 support
  • Mattias: multicore enabled in Umea, receiving a lot of ATLAS multi-core jobs

MW Integration Discussion

Accounting - J. Gordon

Batch systems supported: Torque, PBSPro, LSF, HTCondor, SLURM

Open issues

  • Experiments complain about discrepancies between their own accounting and APEL accounting
    • Is interpretation of batch log correct?
    • Is APEL interpreting logs of all batch systems consistently
  • Are we counting properly failed jobs?

Would like to form a small team of experts for each batch systems

  • Pepe/PIC volunteering for Torque: already tried a lot of times
  • Miguel (CSCS) volunteering for SLURM

IPv6

SLURM and HTCondor should be ok with dual-stack

  • Torque/MAUI: some problems reported by the IPv6 working group, check its wiki
  • UGE: ???
  • Condor: if the pool has IPv6 WN, IPv6 favoured over IPv4

Native IPv6 support still to be done for all batch systems except Condor

  • UGE: not on the roadmap (yet?)
  • SLURM: main developers will not do it but are ready to accept contributions

Multi-core job support

See slides by Antonio Perez

  • Started reviewing batch systems, did HTCondor and UGE but will need more than one iteration
  • Draining is essential to minimize CPU waste but needs reasonably accurate job lifetime estimation
    • Wavelike patterns require to adjust the amount of draining
  • Who'll be charged for the CPU waste?

Main question to be discussed with VO is probably the entropy (mix of different kind of jobs) to allow an efficient backfilling

  • If most jobs are both multi-core and long lived at the same time, this is a problem for shared sites, not dedicated to WLCG
    • In fact most of the sites
  • Jeff: in fact we show that many Atlas jobs at NIKHEF last a few hours. The main issue is to get Atlas advertising the job duration (in WC time) properly so that it can be used for backfilling
  • Need to encourage VOs to keep a mix of job duration, in particular single-core short jobs
    • ATLAS and LHCb said they want to run both multi-core and single core pilots

Be pragmatic, test solutions: main approach from the TF

  • Further reports at the Ops Coord first and late Spring at a GDB

Discussion

  • Michel: you said that mcore/score coexistence needs accurate walltime estimation, not done by any VO, but VOs don't want to do that at all and want to run for as long as possible and will fill the slot as much as possible.
  • Jeff: for HEP sites, what LHCb wants is a good thing: remove the scheduling. The problem comes from sites with many non-HEP VOs which have many short jobs. If a HEP VO stops using the cluster for some time, it will take forever to ramp up again. Only HEP sites like very long jobs.
  • Antonio: it's a new situation, we should test and try to evolve submission tools based on feedback from the sites.
  • Jeff: to solve it, we need to know exactly what the question is
    • Michel: one reason is the memory footprint problem, improved by multi-core jobs
  • Jeff: at NIKHEF, backfilling was very good for some small VOs who advertise the job max WC time

Wrap-Up

Build a table summarizing what was presented today for each batch system

  • A twiki page
  • Strengths and weaknesses, staying non controversial: a lot of material in today's presentations
  • Contacts at sites accepting to be "reference site" for a particular batch system

Don't forget volunteering for the accounting assesment team proposed by John

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2014-03-13 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback