WLCG Operations Coordination twiki page for internal work on action 2015-12-17 (about mem. lim. config. for batch systems).

What is this twiki?

This is a WLCG Operations Coordination twiki page for internal work on action 2015-12-17 (about mem. lim. config. for batch systems). Marias A. & D., Maarten and Alessandra F. will analyse the intermediate results. The recommendation asked by the MB will be a result of the answers we shall get from the T0 and T1s after editing them.

Why do we need it? Background

The action was recorded just before Xmas 2015. See WLCGOpsMinutes151217#Action_list. BUT, the WLCG MB asked for _a recommendation on how to configure memory limits for batch jobs_ in October. See minutes, section 5 and the MB Action list.

Maarten made the suggestion for WLCG Ops to look into this. His summary: The matter came out of the discussion on how much RAM per core (logical CPU) should be recommended as a minimum for new purchases: RAM is important, but there should also be sensible limits on the VMEM/swap per job, which should be a lot higher than the 2 GB RSS that a single-core job may expect to get. It was then said that jobs are often killed "unnecessarily" because of VMEM and that many/most sites have their own ways of deciding what to kill when. Hence it would be good if there were a Twiki page with a matrix of recipes and/or scripts per batch system flavor, version and set of supported experiments.

How to go about getting the info?

Maria D., after email and f2f discussions with Maria A., Maarten and Alessandra F. opened on Jan 12th GGUS tickets to the T0 and the T1s containing the following questions:

  • what batch system they run
  • how they configure memory limits
  • how they kill jobs that exceed these limits
  • if they do it by script are they willing to share it
  • do they use cgroups
  • do they use containers
  • which VOs do they support.

Intermediate results (the raw sites' answers)

Site=GOCDB_name Batch system Memory limits config Handling of excessive jobs cgroups containers VOs ticket
CERN=CERN-PROD LSF 9.1.3.0 By queue to stop unreasonable exceeding resource requirements, but they're not actually enforced. We just let oom_kill kill processes using most of the memory when the node runs short. oom_kill (willing to share excludes but not particularly recommending them) No No All 4 LHC VOs + another 34 VOs GGUS:118800
CERN=CERN-PROD HTCondor HTCondor Configuration which generates CGROUPS HTCondor Configuration Yes Not at the moment All 4 LHC VOs + others GGUS:118800
ASGC=TAIWAN_LCG2 HTCondor Using ARC CE which transforms the memory limit on AGIS to HTCondor HTCondor Configuration No No ATLAS GGUS:118812
BNL=BNL-ATLAS HTCondor Via Cgroups (not hard-limit so OOM still kills, cgroup just contains the damage). AGIS allows site admin to set the max memory allowed for each job payload, for each ATLAS panda queues. Based on the type of jobs expected for each panda queue, we set that limit differently, from 12G to 64G for various BNL panda queues. This limit isn't enforced by pilot in running time (at least at this point), but Panda brokerage uses it in dispatch decision-making process. Condor policy expressions for some classes of jobs, OOM / CGroup for the rest, and an older script too. Probably willing to share but this is the oldest way and is rarely triggered anymore with limits being set very high. Yes No ATLAS + other OSG VOs GGUS:118799
CNAF=INFN-T1 LSF 9.1.3.0 build 238688 By queue to stop unreasonable exceeding resource requirements LSF does it (memory, runtime, cputime) Not yet No All 4 LHC VOs + others GGUS:118803
FNAL=USCMS-FNAL-WC1 HTCondor Via HTCondor mechanisms. Now exploring cgroups. Relying on HTCondor to monitor job resource usage and kill them when the job exceeds the limits. Not yet No CMS + other OSG VOs GGUS:118814
IN2P3=IN2P3-CC Univa Grid Engine (UGE) Memory limits are defined by queue in UGE. The limit (h_vmem) considers the amount of combined virtual memory consumed by all the processes in the job. At submission time, a soft limit (s_vmem) lower than h_vmem can be set. Jobs that exceed the h_vmem limit is killed by UGE with a SIGKILL signal. If s_vmem is exceeded, UGE send a SIGXCPU signal which can be caught by the job. cgroups are currently defined for CPU only. It will be set for memory with the next release. Not at the moment All 4 LHC VOs + others GGUS:118802
JINR=JINR-T1 Torque+Maui Call "ulimit -v " inside /etc/init.d/pbs_mom service script, so this limit applies to all processes running in batch done by the system No No CMS, OPS, DTEAM GGUS:118804
KISTI=KR-KISTI-GSDC-01 Torque+Maui Follow a 3GB per job principle but do not enforce it We do not kill jobs since we do not enforce memory limits. But it is true that sometimes a worker node crashes because of memory exhaustion No No ALICE GGUS:118805
KIT=FZK-LCG2 UGE 8.3.0, update to UGE 8.3.1p6 in Feb 2016. The sge_local_submit_attributes.sh helper converts memory requests of CREAM JDL (i.e. $GlueHostMainMemoryRAMSize) to the corresponding qsub flag ('#$ -l m_mem_free=${memmin}M'). Default limit is 2000M. Virtual memory limit is the 3-fold RSS limit ('#$ -l h_vmem=${vmemmin}M'). RSS limits are soft limits. User jobs can exceed their limits if there is free memory space left. Limits are managed by cgroups, not by UGE. Be aware that by default UGE will send SIGXCPU when a job exceeds its RSS limit. This is not what we want to do. To monitor limits by cgroups instead of UGE you have to set 'ENFORCE_LIMITS=SHELL' in the 'execd_params' attribute of the global UGE configurations ('qconf -mconf'). Yes, we are using cgroups for memory management. UGE configurations: 'cgroups_params group_path=/cgroup cpuset=true mount=true freezer=true freeze_pe_tasks=true killing=true forced_numa=true h_vmem_limit=true m_mem_free_hard=false m_mem_free_soft=true min_memory_limit=500M' No All 4 LHC VOs + others GGUS:118801
NDGF=NDGF-T1 All use Slurm except CSC using SGE Limit set mostly by the job description. If limit absent, sites UCPH & UiB set a 2GB default, site UiO rejects limitless jobs. Site CSC uses SGE conf. variable h_vmem currently set to 5GB. Most don't kill jobs, except those using SGE and/or cgroups [UCPH,!UiO,NSC,!HPC2N]: yes, [CSC,!UiB]: no No, some sites are looking into it ATLAS, ALICE, OPS, OPS.NDGF.ORG, CMS GGUS:118806
NIKHEF=NIKHEF-ELPROD Torque (2.3.8 with some local patches) by setting ulimit in Torque Ulimit is enforced by the OS. No No Atlas, LHCb, Alice + others GGUS:118807
SARA=SARA_MATRIX Torque 4 + Maui currently not - - No ALICE, ATLAS, LHCb + others GGUS:118811
NRC-K1=RRC-KI-T1 Torque+Maui Max pvmem = 4gb, max vmem = 6gb for ATLAS, and max pvmem = 6gb, max vmem = 8gb for ALICE and LHCb Done by Torque No No ATLAS, ALICE, LHCb, OPS, DTEAM GGUS:118810
PIC=pic Torque+Maui Max pvmem of 6GB per queue (single core). Disabled memory limits for mcore queues. Torque kills the jobs that exceed the memory limits No Testing Docker ATLAS, CMS, LHCb + others GGUS:118808
RAL=RAL-LCG2 HTCondor Jobs exceeding 3x requested memory are killed SYSTEM_PERIODIC_REMOVE yes testing Docker universe All 4 LHC VOs + others Text in the twiki by Andrew Lahiff GGUS:118809
TRIUMF=Triumf-LCG2 Torque(2.5.12 with some security fixes)+Maui. Also have some HTCondor test WNs They don't configure memory limit. However, ATLAS pilots set virtual memory limit according to the definition in AGIS. the job memory usage is controlled by the virtual memory limit set by ATLAS pilot. No with Torque but plan to use with HTCondor No ATLAS GGUS:118813

Recommendation

This recommendation results from input received by the T0 and all T1s. The sites use a variety of batch systems: LSF, HTCondor, Torque+Maui, Slurm, SGE, UGE. The best would be for job submissions to request themselves the RAM the jobs need. In any case, configuring a default maximum value of at least 2 GB of RSS per single core slot is recommended. It would be the most practical for sites to select a batch system that itself deals with excessive jobs and/or to make use of cgroups, to avoid extra scripts or manual intervention.

Other twikis with processed information on Batch systems (by Alessandra)

-- MariaDimou - 2016-01-07

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Screenshot_from_2016-01-12_150535.png r1 manage 388.5 K 2016-01-12 - 15:11 MariaDimou Screenshot of the GGUS tickets submitted to the T0 and the T1s on 12-Jan-2016
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2016-02-13 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback