SL4 LCG CE test notes

This page describes all the tests have been done on SL4 LCG CE test

setup

  • CE: 2GB memory, dual 2.8GHz Xeon, lxb6139
  • Torque server: 512MB memory, dual 1.0GHz PIII
  • WNs: 18 WNs with 256 or 512 MB memory, dual 1.0GHz PIII, 10 job slots per node

Tests

  • 2007-8-14: 3000 2000-second sleep job submitted in 3 collections with single user, 3000 (100%) terminated successfully.

  • 2007-8-15: 8000 2000-second sleep job submitted in 8 collections with single user, 7996 (99.95%) terminated successfully. Most of time there were more than 3000 jobs in the queue, and the maximum number of jobs in the queue was 4700. When jobs were pushed to CE by WMS, the CPU load could be larger than 15, the maximum CPU load was 86 in the test.

  • 2007-8-16: 8000 2000-second sleep job submitted with 40 users, test totally failed because it caused very high CPU load. After the tests started, the CE very quickly went to non-response status, for example, the CPU load even reached 725, and there were more than 3223 processes running, most of them are globus-job-manager, globus-gass-cache, grid-monitor, glite-lb-logd, globus-gatekeeper, etc.

  • 2007-8-20: 8000 2000-second sleep job submitted with 10 users, 7874 (98.425%) jobs terminated successfully, 83 jobs failed because of "Unspecified gridmanager error", 42 jobs failed due to "Globus error 94: the jobmanager does not accept any new requests (shutting down)" and one job failed because of "Cannot read JobWrapper output, both from Condor and from Maradona". When the test started, the CPU load on CE very quickly increased to more than 100, and stayed above 100 for about 4 hours, it even reached 201. In this period the job submission speed from WMS to CE is also slow, then later the CPU load decreased and WMS could push jobs much faster to CE, and the CPU load kept from 20-100 until the test was over. Most of time in the test there are more than 3000 jobs in the queue, the maximum number of jobs in the queue is 4565.

  • 2007-8-21: 9000 2000-second sleep job with 15 users, 1872(20.8%) terminated successfully. 505 failed due to "Unspecified gridmanager error", 187 failed due to "Globus error 94: the jobmanager does not accept any new requests (shutting down)" and 3 failed due to "Cannot read JobWrapper output, both from Condor and from Maradona". All other jobs were in other status untill proxy expired ( about 4 days lifetime). Actually all those jobs already finished long time ago, but WMS was slow to get status of those jobs, and the CPU load on those period were around 22. In the test, when the job submission started, the CPU load on CE quickly increased to more than 200, and kept between 160 and 320 close to 5 hours. Then it seems that a balance reached between the job distribution speed from WMS to CE and the CPU load on CE, the CPU load on CE kept around 55 and the number of jobs in the queue steadily increased with the maximum 4869 of jobs. Then the CPU load slightly decreased and rocked between 25 and 50 when jobs drained, but in this period, by chance CPU load could also reach some value as high as 130 because there some extra processes would be launched when jobs terminated.

  • 2008-01-10: 6000 1800-secons sleep job with 30 users, 5107 jobs terminated successfully. 893 jobs failed due to "10 data transfer to the server failed".

  • 2008-01-13: 6000 1800-secons sleep job with 20 users, all jobs terminated successfully.

-- DiQing - 15 Aug 2007

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2008-02-12 - DiQing
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback