CMS Tests with the gLite Workload Management System

11 October, 2006

  • Application: CMSSW_0_6_1
  • WMS host:
  • RAM memory: 4 GB
  • LB server: (*)
  • Number of submitted jobs: 25000
  • Number of jobs/collection: 100
  • Number of collections actually submitted: 234
  • Number of CEs: 24
  • Submission start time: 10/10/06, 18:45
  • Submission end time: 10/12/06, 19:10
  • Maximum number of planners/DAG: 2

Memory usage

During the submission, the swap memory usage increased linearly up to 40%, and decreased rapidly shortly after the job submission stopped. This means that, at some point, the total memory used was 5.8 GB.

The number of planners reached about 250, which accounted for about 1.4 GB. Other processes which used a lot of memory are the WMProxy server (>1.5 GB), the WM (>0.5 GB) and Condor (>0.4 GB).

Concerning WMProxy, the reason why it took so much memory is not clear, and it was suggested to decrease the number of server threads (30 being the current value). In detail, one could:

  • reduce the maximum number of processes running simultaneously (-maxProcesses, -maxClassProcesses)
  • make the "idle" processes killing policy more aggressive (-KillInterval: default is 300 secs))

This is done by changing the "FastCgiConfig" directive at the end of file /opt/glite/etc/glite_wms_wmproxy_httpd.conf as follows:

FastCgiConfig -restart -restart-delay 5 -idle-timeout 3600 *-KillInterval 150 \
  -maxProcesses 25 -maxClassProcesses 10 -minProcesses 5* ..... (keep the rest as it is)

Concerning the WorkloadManager, it is a known issue that the Task Queue eats a lot of memory; this is not seen in gLite 3.1.


During the job submission, attempts to submit jobs from another UI took unreasonable amounts of time (~3' for a single job), probably due to the high level of swapping.

During the submission, the number of jobs in Submitted status kept increasing, meaning that the WMS could not keep up with the submission rate. After the end of the submission, it took about 10 hours to dispatch all the jobs. Again, this is probably due to a general slowness of the machine due to the swapping. The submission rate was also very close to the maximum dispatch rate, and if it was actually a bit higher, jobs would keep accumulating even without the swap memory effect. Therefore it is recommended to submit at a rate significantly lower (maybe 70%?).

It is also important to have as soon as possible the fix which limits the time for which the WM tries to match jobs in the task queue: the current limit of 24 h is too long, because a collection whose jobs cannot be matched is kept alive for a long time, even if it is clear that the jobs cannot ever be matched.

I noticed that, on a very busy RB, jobs in a collection may be matched even 24 hours after submission:

  - JOBID:
    Event           Time         Reason   Exit Src Result            Host
RegJob        10/10/06 20:52:12                NS
RegJob        10/10/06 20:52:15                NS
RegJob        10/10/06 20:53:22                NS
HelperCall    10/11/06 21:10:33                BH
Pending       10/11/06 21:29:02 NO_MATCH       BH

Note: I discovered that I never really used a separate LB server: the LBAddress attribute must be in the common section of the JDL for a collection, not in the node JDL. Another possibility is to configure the RB to use it in the RB configuration. I a recently released tag, an LB server can be specified also in the UI configuration.

-- Main.asciaba - 11 Oct 2006

Edit | Attach | Watch | Print version | History: r6 | r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2006-10-12 - AndreaSciabaSecondary1
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback