CMS Tests with the gLite Workload Management System

11 October, 2006

  • Application: CMSSW_0_6_1
  • WMS host: rb109.cern.ch
  • RAM memory: 4 GB
  • LB server: rb109.cern.ch (*)
  • Number of submitted jobs: 25000
  • Number of jobs/collection: 100
  • Number of collections actually submitted: 234
  • Number of CEs: 24
  • Submission start time: 10/10/06, 18:45
  • Submission end time: 10/12/06, 19:10
  • Maximum number of planners/DAG: 2

Memory usage

During the submission, the swap memory usage increased linearly up to 40%, and decreased rapidly shortly after the job submission stopped. This means that, at some point, the total memory used was 5.8 GB.

The number of planners reached about 250, which accounted for about 1.4 GB. Other processes which used a lot of memory are the WMProxy server (>1.5 GB), the WM (>0.5 GB) and Condor (>0.4 GB).

Concerning WMProxy, the reason why it took so much memory is not clear, and it was suggested to decrease the number of server threads (30 being the current value). In detail, one could:

  • reduce the maximum number of processes running simultaneously (-maxProcesses, -maxClassProcesses)
  • make the "idle" processes killing policy more aggressive (-KillInterval: default is 300 secs))

This is done by changing the "FastCgiConfig" directive at the end of file /opt/glite/etc/glite_wms_wmproxy_httpd.conf as follows:

FastCgiConfig -restart -restart-delay 5 -idle-timeout 3600 *-KillInterval 150 \
  -maxProcesses 25 -maxClassProcesses 10 -minProcesses 5* ..... (keep the rest as it is)

Concerning the WorkloadManager, it is a known issue that the Task Queue eats a lot of memory; this is not seen in gLite 3.1.

Performances

During the job submission, attempts to submit jobs from another UI took unreasonable amounts of time (~3' for a single job), probably due to the high level of swapping.

During the submission, the number of jobs in Submitted status kept increasing, meaning that the WMS could not keep up with the submission rate. After the end of the submission, it took about 10 hours to dispatch all the jobs. Again, this is probably due to a general slowness of the machine due to the swapping. The submission rate was also very close to the maximum dispatch rate, and if it was actually a bit higher, jobs would keep accumulating even without the swap memory effect. Therefore it is recommended to submit at a rate significantly lower (maybe 70%?).

It is also important to have as soon as possible the fix which limits the time for which the WM tries to match jobs in the task queue: the current limit of 24 h is too long, because a collection whose jobs cannot be matched is kept alive for a long time, even if it is clear that the jobs cannot ever be matched.

I noticed that, on a very busy RB, jobs in a collection may be matched even 24 hours after submission:

  - JOBID: https://rb109.cern.ch:9000/nrnlNkJeABRP9u6flFPfyA
    Event           Time         Reason   Exit Src Result            Host
RegJob        10/10/06 20:52:12                NS          rb109.cern.ch
RegJob        10/10/06 20:52:15                NS          rb109.cern.ch
RegJob        10/10/06 20:53:22                NS          rb109.cern.ch
HelperCall    10/11/06 21:10:33                BH          rb109.cern.ch
Pending       10/11/06 21:29:02 NO_MATCH       BH          rb109.cern.ch

Note: I discovered that I never really used a separate LB server: the LBAddress attribute must be in the common section of the JDL for a collection, not in the node JDL. Another possibility is to configure the RB to use it in the RB configuration. I a recently released tag, an LB server can be specified also in the UI configuration.

13 October, 2006

* Application: CMSSW_0_6_1

  • WMS host: rb109.cern.ch
  • RAM memory: 4 GB
  • LB server: lxb7026.cern.ch
  • Number of submitted jobs: 14000
  • Number of jobs/collection: 100
  • Number of collections actually submitted: 140
  • Number of CEs: 28
  • Submission start time: 10/13/06, 11:10
  • Submission end time: 10/14/06, 9:43
  • Maximum number of planners/DAG: 2

30 October, 2006

* Application: CMSSW_0_6_1

  • WMS host: lxb7283.cern.ch
  • Flavour: gLite 3.1
  • RAM memory: 4 GB
  • LB server: lxb7283.cern.ch
  • Number of submitted jobs: 2400
  • Number of jobs/collection: 100
  • Number of collections actually submitted: 24
  • Number of CEs: 24
  • Submission start time: 10/30/06, 12:30
  • Submission end time: 10/30/06, 12:59
  • Maximum number of planners/DAG: 10

Summary table

Site                               Submit  Wait Ready Sched   Run Done(S) Done(F) Abo Clear Canc
cclcgceli02.in2p3.fr                   23     0     0     1     0    76     0     0     0     0
ce01-lcg.cr.cnaf.infn.it                0     0     0     0     0   100     0     0     0     0
ce01-lcg.projects.cscs.ch               0     0     0     0     0   100     0     0     0     0
ce03-lcg.cr.cnaf.infn.it                0     0     0     0     0   100     0     0     0     0
ce04.pic.es                             0     0     0     0     0   100     0     0     0     0
ce106.cern.ch                           0     0     0     0     0   100     0     0     0     0
ceitep.itep.ru                          0     0     0     0     0   100     0     0     0     0
cmslcgce.fnal.gov                       0     0     0     0     0   100     0     0     0     0
cmsrm-ce01.roma1.infn.it                0     0     0     0     0   100     0     0     0     0
dgc-grid-40.brunel.ac.uk                0     0     0     0     0   100     0     0     0     0
egeece.ifca.org.es                     80    20     0     0     0     0     0     0     0     0
grid-ce0.desy.de                        0     0     0     0     0   100     0     0     0     0
grid-ce1.desy.de                        0     0     0     0     0   100     0     0     0     0
grid-ce2.desy.de                        0     0     0     0     0   100     0     0     0     0
grid10.lal.in2p3.fr                     0     0     0     0     0   100     0     0     0     0
grid109.kfki.hu                         0     0     0     0     0   100     0     0     0     0
gridba2.ba.infn.it                      0     0     0     0     0   100     0     0     0     0
gridce.iihe.ac.be                       0     0     0     0     0    97     3     0     0     0
gw39.hep.ph.ic.ac.uk                    0     0     0    49     0     3     0    48     0     0
lcg00125.grid.sinica.edu.tw             0     0     0     9     9    73     0     9     0     0
lcg06.sinp.msu.ru                       0     0     0   100     0     0     0     0     0     0
oberon.hep.kbfi.ee                      0     0     0   100     0     0     0     0     0     0
polgrid1.in2p3.fr                       0     0     0     0     0   100     0     0     0     0
t2-ce-02.lnl.infn.it                    0     0     0     0     0   100     0     0     0     0

Comments

The Submitted jobs at cclcgceli02.in2p3.fr have indeed finished, but in the logging info the last event is a RegJob, whose timestamp, however, is close to the other RegJob events. In addition, for those jobs glite-job-status -v 3 reports timestamps only for Submitted and Waiting. This is linked to the fact that the sequence code of the logged event is wrong: the last RegJob event had a sequence code

UI=000000:NS=0000000001:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000

while the first event from WM/BH had

UI=000000:NS=0000000000:WM=000000:BH=0000000001:JSS=000000:LM=000000:LRMS=000000:APP=000000

instead of

UI=000000:NS=0000000001:WM=000000:BH=0000000001:JSS=000000:LM=000000:LRMS=000000:APP=000000

which causes al subsequent events to be considered prior to the last RegJob. The reason of this behaviour is not yet understood.

The aborted jobs at gw39.hep.ph.ic.ac.uk and lcg00125.grid.sinica.edu.tw had the "unspecified gridmanager error".

The 3 failed jobs at gridce.iihe.ac.be have the "Got a job held event, reason: Globus error 124: old job manager is still alive" error.

The jobs at egeece.ifca.org.es are either Submitted or Waiting because the no CE can be matched. The Waiting jobs are 20, which is strange because the maximum number of planners/DAG is 10.

-- AndreaSciaba - 30 Oct 2006

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2006-10-30 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback