gLite CE Test Work Log

Changes:

* 2007-02-07: Two machines (CE and WMS) on certification testbed upgraded to condor 6.8.4

  • rpm -Uvh --nodeps condor-6.8.4-linux-x86-rhel3-dynamic-1.i386.rpm(glite-CE: Depends: condor (= 6.7.10-1))
  • change the condor version to 6.8.4 in /opt/glite/etc/config/glite-ce.cfg.xml and /opt/glite/etc/config/glite-wms.cfg.xml
  • change 215 by hand (HOSTALLOW_WRITE to something, and also HOSTALLOW_READ, need to thing) in /opt/condor-c/etc/condor_config
  • Add GRIDMANAGER_TIMEOUT_MULTIPLIER = 3, SCHEDD_TIMEOUT_MULTIPLIER = 3, COLLECTOR_TIMEOUT_MULTIPLIER = 3, C_GAHP_TIMEOUT_MULTIPLIER = 3, C_GAHP_WORKER_THREAD_TIMEOUT_MULTIPLIER = 3, TOOL_TIMEOUT_MULTIPLIER = 3, GLITE_CONDORC_DEBUG_LEVEL = 2, GLITE_CONDORC_LOG_DIR = /var/tmp, NEGOTIATOR_MATCHLIST_CACHING = False in /opt/condor-c/local./condor_config.local on WMS
  • Add GLITE_LOCATION=/opt/glite/ in /opt/condor-c/local./condor_config.local on glite CE

* 2007-02-27: Upgrade condor to 6.8.4 on virtual testbed following above steps

* 2007-03-02: Upgrade lxb0743 to patch 1031 which is back port of 3.1 blah that fixed the bug 20910

* 2007-03-20: set C_GAHP_WORKER_THREAD_TIMEOUT_MULTIPLIER = 10 in /opt/condor-c/local./condor_config.local on WMS

* 2007-03-21: ulimit -n 4096 on lxb2034 and lxb0743, and echo 16383 > /proc/sys/kernel/threads-max, on CE changed /opt/condor-c/condor_config to increase MAX_SCHEDD_LOG to 100000000 and QUEUE_CLEAN_INTERVAL to 345600 to avoid log info rolled out quickly

* 2007-03-21: removed Condor.glidein under pool account home directory since there are a lot of jobs left, this is to check if the left jobs caused condor slowed down. Actually in my test, before removing it takes 226 seconds from CompletionDate to StageOutStart, after it removing, it only took 13 seconds

* 2007-04-12: SCHEDD_TIMEOUT_MULTIPLIER = 5 on CE and C_GAHP_WORKER_THREAD_TIMEOUT_MULTIPLIER = 20 on WMS

* 2007-05-14: On the gLiteCE node (lxb0743), GRAM was saving its state files under /var/glite/gram_job_state/, as specified in the following jobmanager config file: /opt/glite/etc/globus-job-manager.conf. However, the gridmonitor (in any version, including the LCG one) is looking for the state file directory starting from $GLOBUS_LOCATION/etc/grid-services, which leads to the GLOBUS default config file: /opt/globus/etc/globus-job-manager.conf. This files sets "-state-file-dir /opt/globus/tmp/gram_job_state" but opt/globus/tmp/gram_job_state was empty, so the state of fork jobs was never correctly updated. For a quick fix, I simlinked /opt/globus/etc/grid-services/jobmanager to /opt/glite/etc/grid-services/jobmanager-fork. Another good solution is to simlink /opt/globus/tmp/gram_job to /var/glite/gram_job_state/.

* 2007-05-22: Condor has been upgraded to 6.8.5 on both lxb0743(glite CE) and lxb7283 (WMS), on WMS, the value of C_GAHP_WORKER_THREAD_TIMEOUT_MULTIPLIER has been decreased to 10, and on CE, CHEDD_TIMEOUT_MULTIPLIER = 5 is removed since 6.8.5 should have fixed c-gahp-> schedd error bug.

* 2007-05-26: on gLite CE, INFN_JOB_POLL_INTERVAL is increased from 120 to 360 (because with condor 6.8.5, the CPU load on glite CE is too large, many time it is larger than 20, sometime even larger than 40). on WMS, C_GAHP_WORKER_THREAD_TIMEOUT_MULTIPLIER is increased to 20 again

* 2007-06-10: on gLite CE, in /opt/condor-c/local./condor_config.local, reduce GRIDMANAGER_MAX_PENDING_REQUESTS to 5, INFN_JOB_POLL_INTERVAL can be increased from the default value as well, Actually both value depend on the machine performance and the number of simultaneous jobs.

* 2007-07-01: on lxb6115 change all the log level parameters to 0 and silent_logging to "yes" because the log of lcas and lcmaps reached 2GB file size limit very easily.

* 2007-07-18: On WMS, GRIDMANAGER_MAX_PENDING_REQUESTS is changed from 1000 to 20, CONDOR_JOB_POLL_INTERVAL is increased from 10 to 300, this is to reduce the load on Schedd on CE. GRIDMANAGER_GAHP_CALL_TIMEOUT=900 is commented out since if the stageout met the timeout, job add will corrupt, the default value is 8hours.

* Notice: The condor can be download from http://lxb2042.cern.ch/gLite/APT/R3.1-RB-pretest/RPMS/externals/condor-6.8.5-1.i386.rpm for the moment, the latest blahp in our repository is http://lxb2042.cern.ch/gLite/APT/R3.0-cert/rhel30/RPMS.patch1173.uncertified/glite-ce-blahp-1.5.22-1.i386.rpm, on batch system server node, /opt/glite/etc/blparser.conf should be created according to the template in the same location, and start BLPaser server by "/opt/glite/etc/init.d/glite-ce-blparser start".

Basic tests

* 2007-02-09: 1) 300 long running jobs to verify that the 100 maximum job limit on glite CE bug solved. 2)1000 "Hello World" jobs in single stream to the same glite CE finished without any problem.

* 2007-02-07: After rebooting gLite CE, if there were condor instances running for users before rebooting, new condor instances could not be started.

* 2007-02-12: On one of WMS (lxb0744), suddenly all jobs submitted to all glite CE failed, however it recovered by itself in the second day, suspect there are problems with one condor launcher job which blocked others. When the proxy expired with that job, it was removed. We will try to reproduce the problem.

* 2007-02-27: When a job is in running state on the LRMS, inside the gLiteCE, and is cancelled by the user from the UI it stucks. The issue is described in bug #24209 (http://savannah.cern.ch/bugs/?23779). If all the slots managed by the CE get a stucked job, it continues to accept new jobs but is unable to execute any of them, causing major damage.

Stress tests:

* 2007-02-07:

-> Gradual stress test at ctb-ce-1 (Virtual machine, 512Mem, 512Swap, Condor-6.8.3)

Every 5 minutes, a bunch of 5 jobs was submitted directly from the wms via condor_submit. Each job emulated the needs of a different user, forcing the gliteCE to create a new scheduler to handle every job.

Total number of submitted jobs: 250 - Completed: 210; Hold (QDate + 900' evaluated to true): 30; Hold (Unknown): 10.

System overload turned the machine into non-responsive state.

* 2007-02-08:

-> Gradual stress test with extended interval time at ctb-ce-1 (Virtual machine, 512Mem, 512Swap, Condor-6.8.3)

Every 10 minutes, a bunch of 5 jobs was submitted directly from the wms via condor_submit. Each job emulated the needs of a different user, forcing the gliteCE to create a new scheduler to handle every job.

Total number of submitted jobs: 300 - Completed: 220; Hold (QDate + 900' evaluated to true): 46; Hold (Attempts to submit failed): 11; Hold (Failed to authenticate using FS): 2; Hold (Failed to get expiration time of proxy): 3; Hold (Error): 5; Permanent down Grid Resource: 13.

System overload turned the machine into non-responsive state again.

* Bad results force the use of a non-virtual machine!!!

-> Pure stress test at lxb0743 (Real machine, 512Mem, 1024Swap, Condor-6.8.4)

A bunch of 100 jobs was submitted directly from the wms via condor_submit. Each job emulated the needs of a different user, forcing the gliteCE to create a new scheduler to handle every job.

Total number of submitted jobs: 100 - Completed: 71; Hold (QDate + 900' evaluated to true): 28; Hold (Attempts to submit failed): 1;

* 2007-02-15: 4000 jobs to the same glite CE, lxb0743 through two WMS's(one with 6.8.4 condor and one with 6.7.19 condor) as two VOs, thus there were 4 condor instances on the CE, 1000 jobs per instance at the same time.

    • 3869 jobs finished successfully.
    • Most of failed jobs have info in log as : Got a job held event, reason: Repeated submit attempts (GAHP reports:) from logmonitor
    • After testing, there were 220 jobs left in torque queue in "W" status: from the email sent by torque, it is because of "Unable to copy file StandardOutput" and "Unable to copy file StandardError", see bug 20910 (https://savannah.cern.ch/bugs/?20910)

* 2007-02-15: 3000 jobs to the same glite CE, lxb0743 through WMS lxb0744 (condor version 6.8.4). The jobs were distributed by 30 dteam test users, 100 jobs each one. It was used a single "traffic_simulator" script which allowed that in each iteration all the users could submit a job (random order). All the necessary 30 condor schedullers were created in the beginning of the test. Having no problems here, the rest of the test carried on smoothly, finishing with 3000 "Done" jobs.

* 2007-02-23: 2250 jobs to the same glite CE, lxb0743 through WMS lxb0744 (condor version 6.8.4).

    • 2118 jobs finished successfully
    • 141 jobs got aborted
    • 1 job remains in waiting state
Brief description: 15 test users began to submit at point 0, after each one submitted 50 jobs (total of 750) until critical point 1. At this point, the gLiteCE was already heavy loaded with jobs... Another 15 users began also to submit jobs (forcing the creation of condorc processes in the loaded CE). The CE handled the task and the success rate was being 100%. Then critical point 2 was introduced, all condorc-launcher-starter entries were removed from the condor queue on the WMS. Obviously, the service saw that they were missing and resubmitted. Some had no problems (several users ended with 100% success rate) but for others something went wrong. Most users lost 1-2 processes around this time but for 3 users the job execution was finished... Logging info tell us that:
    • 96 jobs - Got a job held event, reason: Failed to get expiration time of proxy
    • 45 jobs - Got a job held event, reason: Error
The "failed to get expiration time of proxy" error messages were reported as bug #24151 (https://savannah.cern.ch/bugs/?24151).

* 2007-02-27:

-> Pure stress test at lxb0743 (Real machine, 512Mem, 1024Swap, Condor-6.8.4)

A bunch of 400 jobs was submitted directly from the wms via condor_submit. Each job emulated the needs of a different user, forcing the gliteCE to create a new scheduler to handle every job. All jobs were submitted at critical point "time 0".

Although only a small percentage of jobs was completed, all the others got hold due to excessive time on condor queue waiting for being delivered to the CE. The CE didn't accept more than it was capable of handling. Unlike the virtual machine (ctb-ce-1) it didn't get into non-responsive state. From that point of view, it behaved quite well. The fact of lxb0743 does not run a torque server is a good explanation, showing that separate gLiteCE and torque server nodes presents a more robust solution.

Total number of submitted jobs: 400 - Completed: 79; Hold (QDate + 900' evaluated to true): 321;

* 2007-03-02:

-> Continuous stress test at lxb0743 (Real machine, 512Mem, 1024Swap, Condor-6.8.4)

The test was made using 4 Test Users. Each one submitted 100 jobs every 30 minutes (total rate of 800/hour). A total of 10067 jobs were submitted, unfortunately the test got aborted due to some ssh timeout to the UI, since the primary objective was to submit 48000 jobs during the weekend. The final results are then not an "endurance" test, but a typical continuous stress test with the following results:

Total number of submitted jobs: 10067 - Completed: 10066; Failed (Removal retries exceeded): 1;

Logging information about the failed job was the following "LM message: the timeout attached to the globus-down event expired"

* 2007-03-05: 10000 jobs in 10 dags (1000 jobs per dag) through lxb7283 to lxb0743 with ATLAS(6 dags) and DTEAM (4 dags), each job sleep 300 seconds, the test has run 2 days. Failure rate caused by is 0.11%:

  • 1 job failed because of "Got a job held event, reason: Error" on glite CE condor queue, it's difficult to figure out the Error
  • 10 jobs failed because of "Cannot take token", from log info, this job executed twice on different WNs, so suspect that blahp submitted it to batch system, but could not get success submission signal(e.g., blah could not get the jobid), then it submitted it again (Caused by CE). When the first submitted job ran, it removed the token(actually it done sucessfully), thus the second submitted job could not take the token and log such event into LB.
  • Due to that the maximum planner number is limited to 8 on lxb7283, it is slow to pass the jobs to CE, the maximum jobs in torque is around 160.

* 2007-03-07: 6000 jobs in 6 dags (100 jobs per dag) through lxb7283 to lxb0743 with ATLAS and DTEAM, each job slept 600 seconds, the test has run 1 day.

  • In the test, most of time, there were more than 2000 jobs in the queue, at most it reached 3874.
  • The CPU load was between 3 - 6 in the test, some time it reached 9.
  • Only 3793 (63.22%) succeed.
  • Many job failed due to "Cannot take token" or "Standard output does not contain useful data.Cannot read JobWrapper output, both from Condor and from Maradona." or "Got a job held event, reason: Repeated submit attempts (GAHP reports:)" And the jobs had been executed two or three times due to the resubmission.

* 2007-03-11: 6000 jobs in 6 dags (100 jobs per dag) through lxb7283 to lxb0743 with ATLAS and ATLAS SGM Role, each job slept 600 seconds, the test has run 1 day.

  • In the test, most of time, there are more than 2000 jobs in the queue, at most it reached 3219.
  • The CPU load was between 3 - 7 in the test, some time it reached 9
  • Only 4056 (67.6%) succeed
  • Major failure reason is the same as the test on 7 March.

* 2007-03-12: The same 6000 jobs test, but since the pbs_mom is quite unstable after expanding the virtual CPU number to 30 on each WN, the test results are almost meaningless.

* 2007-03-13: The same 6000 jobs test, but since the condorc instances on glite CE crashed and were unable to be relaunched, the test results are almost meaningless again.

* 2007-04-10: 4000 jobs, each sleep 900s, before starting the test, the queues are filled by 1 hour long in order for the test job to be accumulated in the queue. Since there were too many jobs accumulated in queue, only 12 jobs aborted. 11 jobs failed with with "Repeated submit attempts (GAHP reports:)", one failed with "Attempts to submit failed". Debug info shows the pbs submission script of blah successfully submitted these jobs, but gridmanager of condor did not get the info.

* 2007-04-11: 2000 1-hour sleep jobs and 4000 900-second sleep jobs. 5942 finished successfully. 33 jobs in Done (Failed) status and 25 jobs in Aborted status. The reasons are "Repeated submit attempts (GAHP reports:)", "Attempts to submit failed" and " Standard output does not contain useful data.Cannot read JobWrapper output, both from Condor and from Maradona". All Done (Failed) jobs have info as "Got a job held event, reason: Error". My checkpoints for pbs_submit.sh and pbs_status.sh shows both scripts worked fine.

* 2007-04-12: 4000 3000-second sleep jobs submitted in two roles. 3670 (91.75%) job finished successfully. The maximum jobs in the queue is 2692, and most of time there are more than 1000 jobs in the queue, also the CPU load on CE is as large as 8. The job failure reasons are still the same as previous days. Besides, the reason for "Repeated submit attempts (GAHP reports:)" is found too, it is a bug inside Condor, e.g., network timeouts were not checked while the condor_status_constrained command is executed. In more than 4000 qsub events of pbs_submit.sh, there is one submission failed, probably because machine was too busy.

* 2007-04-13: 4000 3000-second sleep jobs submitted in two roles. 2978 jobs finished successfully. 199 jobs in Done (Failed ) status. 823 jobs were aborted. But by mistake, there were jobs submitted with short proxy life time (12h), thus 807 jobs failed due to this although WMS renewed proxy and relaunched Condor-C instance on CE side, the failure reason is "Removal retries exceeded", log info shows "LM message: the timeout attached to the globus-down event expired." This is another good reason for us to move to VO based Condor instance. In this test, C_GAHP_WORKER_THREAD_TIMEOUT_MULTIPLIER increased from 10 to 20 on WMS and on CE SCHEDD_TIMEOUT_MULTIPLIER = 5 is set. Subtracting the job failed due to expired proxy, we got 93.1% success rate. But there still a large fraction jobs failed with "Done (Failed)" and in log info, only "Got a job held event, reason: Error".

* 2007-04-14: 6000 2400-second sleep jobs submitted in two roles. 5470 (91.17%) jobs finished successfully. 467 jobs in Done (Failed) status. 63 jobs were aborted, in which 61 failed due to "Standard output does not contain useful data.Cannot read JobWrapper output, both from Condor and from Maradona", and they also have info like "Cannot take token" and all of them were resubmitted, but my debug info shows pbs_submit.sh finished correctly. Apparently increasing schedd timeout on CE helps a lot since no jobs were aborted due to Condor schedd timeout. One job had problems to download the executable probably due to corrupted proxy. For the job in Done (Failed) status, take https://lxb7283.cern.ch:9000/-4S2W3IXAv9S76Ez5uYlew as example, the pbs job id is 210609.lxb2035.cern.ch, in condor job log on CE it was given a error: "Error" at 02:26:32, in pbs server log, it is shown that this job started running at 01:22:04 and finished at 02:02:27. However, there was a "DeleteJob" for this job at 02:26:28 and pbs gave a error message:"Unknown Job Id" since it already dequeued after finishing. From the time stamp, this should be responsible to the "Error" message in condor logs. In condor log, at first there is a message "JobStatus 5" which should mean job is on hold, then it tried to cancel.

* 2007-04-20: 10000 jobs submitted (of 3 different sleep times: 30 minutes, 1 hour, 2 hours) The submission was made in collections of 1000/2000 jobs each.

The first set (all 30 minutes jobs) had relatively good results (99.4%). Only 6 jobs failed because of errors of the following type: "All jobs failed due to errors of the type: Cannot upload waiting.err into gsiftp://lxb7283.cern.ch:2811/var/glite/SandboxDir/Af/https_3a_2f_2flxb7283.cern.ch_3a9000_2fAfi2hHrRzbui3PANIu59sw/output/waiting.err"

The following sets were affected by a couple of new issues: A bug leading to the SEGV of the Worker thread. A bug making corrupted filenames. With this matter, the file transfer can fail due to a bad hash function returning a negative result. Both are currently under investigation and may be solved on a new condor release (since Wisconsin is working in the condor code).

Results of the following sets on day1 testing (with C_GAHP_WORKER_THREAD_TIMEOUT_MULTIPLIER = 30):

Set 2 ( 1000 jobs of 1 hour) Results: Job got an error while in the CondorG queue. - 39; hit job retry count (0) - 22; Job terminated successfully - 939

Set 3 ( 1000 jobs of 30 minutes) Results: Job got an error while in the CondorG queue. - 171; Job terminated successfully - 829

Set 4 ( 1000 jobs of 1 hour) Results: hit job retry count (0) - 2; Job terminated successfully - 998

Set 5 ( 1000 jobs of 30 minutes) Results: hit job retry count (0) - 9; Job terminated successfully - 991

Set 6 ( 1000 jobs of 1 hour) Results: Job got an error while in the CondorG queue. - 2; hit job retry count (0) - 3; Job terminated successfully - 995

Set 7 ( 1000 jobs of 30 minutes) Results: Job got an error while in the CondorG queue. - 2; hit job retry count (0) - 9; Job terminated successfully - 989

Set 8 ( 1000 jobs of 1 hour) Results: Job got an error while in the CondorG queue. - 3; hit job retry count (0) - 3; Job terminated successfully - 994

Set 9 ( 1000 jobs of 2 hours) Results: hit job retry count (0) - 38; Job terminated successfully - 1962

Conclusions: After CondorC queue on the gLiteCE recovered from the blocked status (the jobs were stopped for about one day) the success rate went back to acceptable patterns. The duration of the job does not appear to be a problem anymore (WNs can receive incomming connections) and therefore the longer jobs are not affected by this issue. Even considering the problem that affected the queue in the early stage of the test, a final success rate of 96.91 was achieved. Since the test also served to find out 2 new bugs, we must consider this results as very good.

* 2007-04-25: 4000 jobs submitted (of 2 different sleep times: 30 minutes, 1 hour) Note: It was a continous test following the previous (that took more time than expected due to some time blocked)

Set 1 ( 1000 jobs of 30 minutes) Results: Job got an error while in the CondorG queue. - 35; hit job retry count (0) - 11; Job terminated successfully - 954

Set 2 ( 1000 jobs of 1 hour) Results: Job got an error while in the CondorG queue. - 2; hit job retry count (0) - 5; Job terminated successfully - 993

Set 3 ( 1000 jobs of 30 minutes) Results: Job got an error while in the CondorG queue. - 37; hit job retry count (0) - 2; Job terminated successfully - 961

Set 4 ( 1000 jobs of 1 hour) Results: Job terminated successfully - 1000

Conclusions: As the results appeared to be getting worst (set1) it was decided that it was better to restart the stress test and avoid any influence of the problems detected on day 1 testing. The success rate for this 4000 jobs was 97.7%. The final set, that in principle was executed when the queue was almost empty got a perfect 100%.

* 2007-04-26: 6000 jobs submitted (of 3 different sleep times: 30 minutes, 1 hour, 2 hours)

Set 1 ( 1000 jobs of 30 minutes) Results: Job terminated successfully - 1000

Set 2 ( 1000 jobs of 1 hour) Results: Job terminated successfully - 1000

Set 3 ( 1000 jobs of 2 hours) Results: Job got an error while in the CondorG queue. - 37; hit job retry count (0) - 2; Job terminated successfully - 961 Results: Removal retries exceeded. - 12; Job terminated successfully - 988

-> After this 3 sets most of the jobs failed due to "Removal retries exceeded". LM message: the timeout attached to the globus-down event expired, i.e, problems with the remote condor queue.

* 2007-05-25: 6000 2400-second sleep jobs submitted in 6 dags within two roles. 5798 jobs are in "Done (success)" (96.65%). 38 jobs are in "Done (Failed)" status, and 164 jobs are aborted. This test was done with condor 6.8.5. Before jobs were submitted, there were 190 1 hour running jobs submitted to fill the queue. During the test, on gLite CE, most of time, there were more than 2000 jobs in the queue, even it reached 4349, the CPU load was quite large, it was often larger than 20, some time it even reached 40.

Stress tests with glexec:

* 2007-07-01: 5000 2200-second sleep jobs submitted in 5 dags to glexec CE with single user. But because the log of lcas and lcmaps reached the 2GB file size limit, blah stopped working, the test was canceled.

* 2007-07-02: 4000 2200-second sleep jobs submitted in 4 dags in single user. Three of them stuck due to testbed upgrade. The success rate for the finishing is 99%.

* 2007-07-03: 4000 2200-second sleep jobs submitted in 4 dags in single user. 3943 jobs (98.575%) finished successfully. 43 jobs failed because of error "Error connecting to schedd atlas@lxb6115NOSPAMPLEASE.cern.ch". 8 jobs failed because of "Attempts to submit failed". 2 jobs failed due to unclear reasons since condor only gave "Error".

* 2007-07-08: 6000 2200-second sleep jobs submitted in 6 dags in single user. 5993 jobs (99.88%) finished successfully. 7 jobs all failed due to "Error connecting to schedd atlas@lxb6115NOSPAMPLEASE.cern.ch".

* 2007-07-09: 6000 2200-second sleep jobs submitted in 6 dags in single user. 5982 jobs (99.7%) finished successfully. 18 jobs all failed due to "Attempts to submit failed".

* 2007-07-10: 10000 1100-second sleep jobs submitted in 10 dags in single user. Tested aborted due to certificates revoked.

* 2007-07-11: Multiple users test ( 40 users - 20 jobs each , 30 minutes interval, and again 40 users - 20 jobs each). Machines used: lxb7283 - WMS, lxb7026 - LB, lxb6115 - glexec gCE on SLC3

Total: 1600 jobs - 990 terminated successfully ( 0.61875%)

* 2007-07-17: Multiple users test ( 40 users - 100 jobs each, each job sleep 2200 seconds). 2740 (68.5%) terminated successfully. 767 - Cancellation command failed, 465 - Error connecting to schedd dteam@lxb6115NOSPAMPLEASE.cern.ch, 28 Error locating schedd dteam@lxb6115NOSPAMPLEASE.cern.ch.

* 2007-07-18: Multiple users test ( 40 users - 50 "Hello World" jobs each). 2000 (100%) terminated successfully.

* 2007-07-19: Multiple users test ( 40 users - 150 jobs each, each job sleep 2200 seconds), 3010 terminated successfully, most of other jobs failed due to the 12 hours limited voms extension lifetime. This is caused by that option "-hours" was used to create long voms proxy. In this case although grid proxy has long lifetime, but voms extension lifetime is still limited to 12 hours. The solution is to use option "-valid".

* 2007-07-20: Multiple users test ( 40 users, 2000-second sleep job), 7975 submitted, 75 jobs failed due to my mistake. In 7900 jobs, 7816 (98.94%) terminated successfully. In 83 failing jobs, 68 failed because of "Cannot download test.sh from gsiftp....", 7 jobs failed due to "Error connecting to schedd dteam@lxb6115NOSPAMPLEASE.cern.ch", 8 jobs failed because of "Attempts to submit failed". one job could not executed.

* 2007-07-19: Multiple users test ( 40 users, 2000-second sleep job) 7975 submitted,plus 2000 single user jobs of another separate VO ( sleep 1100 seconds). 9719 (97.43%) jobs terminated successfully. 15 jobs failed due to "Error connecting to schedd ...". 238 jobs failed due to "Cannot read JobWrapper output, both from Condor and from Maradona", 3 jobs failed due to "Cannot download ... from gsiftp...". Suspect that the large failures are caused by the short lifetime of the limited proxy.

* 2007-07-23: Multiple users test ( 40 users, 2000-second sleep job), 7900 job submitted,plus 2000 separate single user jobs( sleep 1100 seconds) 9889 (~99.89%) terminated successfully. 11 jobs failed due to "Attempts to submit failed". 2 job 8failed due to "Got a job held event, reason: Error".

* 2007-07-25: Multiple users test ( 40 users, 2000-second sleep job), 7950 job submitted. 7806 (98.19% )terminated successfully. 122 failed due to"Error connecting to schedd..". 18 failed due to "Got a job held event, reason: Error". Suspect that most of failures were caused by unstable network connection since in gridmanager log it showed "GAHP[12717] (stderr) -> BLClient: Invalid hostname" which can happen when BLclient could not resolve BLPaser server hostname.

* 2007-07-24: Multiple users test ( 40 users, 2000-second sleep job), 8000 job submitted. 8000 (100%) terminated successfully. In this test, C_GAHP_WORCEKER_THREAD_TIMEOUT_MULTIPLIER was increased from 5 to 7 on WMS and on GRIDMANAGER_TIMEOUT_MULTIPLIER was increased from 5 to 7 as well. And also the CE IP and hostname are added into /etc/hosts on WMS, and on CE, in blah.config, BLPaser server IP address is used, also the IP and hostname of BLPaser server are added into /etc/hosts as well.

* 2007-07-25: Multiple users test ( 40 users, 2000-second sleep job), 8000 job submitted. 8000 (100%) terminated successfully. In this test, GRIDMANAGER_MAX_PENDING_REQUESTS was increased from 20 to 200 to check if it is compatible with LCG CE submission(originally it was 1000).

* 2007-07-26: Multiple users test ( 40 users, 2000-second sleep job) with SL4 glexec CE, 7600 jobs submitted. 7584 (99.79%) jobs terminated successfully. 16 jobs failed due to "Error connecting to sched". Since this is a new CE, its hostname was not added into /etc/hosts on WMS, so it looks like unstable DNS really can affect condor.

* 2007-07-30: Multiple users test ( 40 users, 2000-second sleep job) with SL4 glexec CE, 8000 jobs submitted, 7997(~99.96%) terminated successfully. One failed due to " Error connecting to schedd", One failed due to "Attempts to submit failed", one failed due to "Cannot read JobWrapper output, both from Condor and from Maradona".

-- Main.nvazdasi - 12 Jul 2007

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatjdl test.jdl r1 manage 0.3 K 2007-08-07 - 11:05 DiQing JDL_used_for_test
Edit | Attach | Watch | Print version | History: r46 < r45 < r44 < r43 < r42 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r46 - 2007-08-07 - DiQing
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback