Job running time, Memory and local disk limits to protect workernodes

Introduction

  • Purpose: define limits for job running time, memory and local disk space consumption for GRID jobs/applications to protect workernodes
  • Limits can be imposed on 4 different levels ordered by distance to the actual processes on the workernode
    1. site protections on the batch system level
    2. pilot protections in the pilot
    3. protections in the WMAgent/CRAB job wrapper
    4. protections directly in CMSSW
  • limits should be stricter the larger the distance to the actual process on the workernde
  • Advice
    • Job termination should be tried first gracefully (finish processing current event) when possible and return output for debugging, next a hard kill is ok

Proposed limits

Property Batch system limit glideIn pilot limit WMAgent/CRAB job wrapper CMSSW limit
memory 2.5 GB RSS 2.4 GB RSS 2.3 GB RSS 2.2 GB RSS
job running time 48:00 hours 47:50 hours 47:40 hours 40:00 hours
local disk space 20 GB 19.5 GB 19 GB  

  • Comments:
    • 40 hour limit for CMSSW should allow for stageout and close out of job
      • Need to think about chained workflows (multiple steps run on the same workernode)
    • local disk space limitation not applicable to CMSSW limit level
    • memory limit specified in RSS, on SL6 and later PSS should be used
    • batch system limits:
      • if sites have more resources per core, limits should be adapted to support the additional resources
        • Ops should be informed and we should maybe setup special queues

Analysis April 2012: local disk usage from glideIn schedd used for analysis and production

DiskUsage_plot.png

Analysis April 2012: information from DashBoard for analysis and production jobs

avgCPUtime_analysis.png avgCPUtime_production.png
avgWalltime_analysis.png avgWalltime_production.png
maxCPUtime_analysis.png maxCPUtime_production.png
maxWalltime_analysis.png maxWalltime_production.png

Analysis April 2012: information from CouchDB only for production jobs

PeakValueRss.png PeakValueVsize.png
TotalJobCPU.png TotalJobTime.png

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng DiskUsage_plot.png r1 manage 31.0 K 2012-06-08 - 21:31 OliverGutsche  
PNGpng PeakValueRss.png r1 manage 7.5 K 2012-06-08 - 21:31 OliverGutsche  
PNGpng PeakValueVsize.png r1 manage 8.1 K 2012-06-08 - 21:32 OliverGutsche  
PNGpng TotalJobCPU.png r1 manage 7.5 K 2012-06-08 - 21:32 OliverGutsche  
PNGpng TotalJobTime.png r1 manage 7.1 K 2012-06-08 - 21:32 OliverGutsche  
PDFpdf avgCPUtime_analysis.pdf r1 manage 264.3 K 2012-06-08 - 21:15 OliverGutsche  
PNGpng avgCPUtime_analysis.png r1 manage 67.1 K 2012-06-08 - 21:31 OliverGutsche  
PDFpdf avgCPUtime_production.pdf r1 manage 168.9 K 2012-06-08 - 21:15 OliverGutsche  
PNGpng avgCPUtime_production.png r1 manage 59.5 K 2012-06-08 - 21:31 OliverGutsche  
PDFpdf avgWalltime_analysis.pdf r1 manage 170.0 K 2012-06-08 - 21:15 OliverGutsche  
PNGpng avgWalltime_analysis.png r1 manage 63.7 K 2012-06-08 - 21:31 OliverGutsche  
PDFpdf avgWalltime_production.pdf r1 manage 170.0 K 2012-06-08 - 21:15 OliverGutsche  
PNGpng avgWalltime_production.png r1 manage 74.6 K 2012-06-08 - 21:31 OliverGutsche  
PDFpdf maxCPUtime_analysis.pdf r1 manage 169.2 K 2012-06-08 - 21:15 OliverGutsche  
PNGpng maxCPUtime_analysis.png r1 manage 69.2 K 2012-06-08 - 21:31 OliverGutsche  
PDFpdf maxCPUtime_production.pdf r1 manage 169.2 K 2012-06-08 - 21:15 OliverGutsche  
PNGpng maxCPUtime_production.png r1 manage 70.3 K 2012-06-08 - 21:31 OliverGutsche  
PDFpdf maxWalltime_analysis.pdf r1 manage 170.0 K 2012-06-08 - 21:14 OliverGutsche  
PNGpng maxWalltime_analysis.png r1 manage 67.2 K 2012-06-08 - 21:31 OliverGutsche  
PDFpdf maxWalltime_production.pdf r1 manage 170.1 K 2012-06-08 - 21:14 OliverGutsche  
PNGpng maxWalltime_production.png r1 manage 62.0 K 2012-06-08 - 21:31 OliverGutsche  
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2012-06-13 - OliverGutsche
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback