For updates to this see CreamTests2011

Configuration changes

  • 2008-07-01: set lcmaps_debug_level = 0, lcmaps_log_level = 1, lcas_debug_level = 0, lcas_log_level = 1 in /opt/glite/etc/glexec.conf
  • 2008-07-08: Update yaim cream-ce to 4.0.4-10
  • 2008-07-09: Update yaim cream-ce to 4.0.4-12

Issues

  • Installation
    • what to do for mysql-connector-java, tomcat5 and mysql-server?

  • Configuration
    • BATCH_CONF_DIR, it is only needed for LSF, for Torque it is not needed
    • CEMON_HOST and ACCESS_BY_DOMAIN, need to put in site-info.def or somewhere by default
    • config_cream_blparser is for BLPaser server, it should not be included in glite-creamce because in principle CREAM CE is juts a submitter. For torque server, we probably should include it there, for other batch system, it may be installed and configured by hand.
    • After new CA installed, we have to rerun configuration, why? Do we need rerun configuration every time when CA is upgraded?
    • /opt/glite/bin/glexec-wrapper.sh permission seems wrong, we have to run chmod a+x for it. (Invalid)
    • pbs_BLPserver always points to 127.0.0.1, if BLPaser server is on different machine, it is wrong
    • /opt/log4cpp/lib path need to be added into LD_LIBRARY_PATH on UI (done)
    • configuration change for new glexec.
    • CREAM_DB_USER needed to be set (from yaim cream-ce to 4.0.4-10)

  • glexec, lcas and lcmaps is too verbose, i.e., generate too many logs.
  • There are too many files left under the home directories of pool account.
  • any cleanup for CREAM sandbox?
    • if job submitted by CREAM CLI, purge should be done by users
    • If job submitted by ICE, it will be purged automatically
  • Need to investigate: existed delegate proxy ID can be reused or not? "Description=[delegation error: the proxy delegationID "testcream2" is not more valid!;] FaultCause=[delegation proxy expired!]"
  • After reconfiguring CREAM CE (yaim cream-ce to 4.0.4-12), in the first few minutes, the job submission to CREAM CE failed with "FATAL - Connection to service [https://lxb7347.cern.ch:8443/ce-cream/services/gridsite-delegation] failed. Check URL"

Stress test

  • 2008-06-16: 4000 2000s-sleep job with single proxy in 10 threads
    • 3999 jobs finished successfully
    • 1 job stay in queue forever, torque had problems to stage in the executable, suspect it caused by race condtion
    • a lot of files left under pool account home directory, the files are with name cream_*.o, cream_*.e, they are the stdout and stderr of batch system.

  • 2008-06-17: 10000 1000s-sleep job with single proxy in 10 threads
    • 9058 jobs finished successfully
    • 891 jobs in "DONE-FAILED" status with "pbs_reason=1"
      • by checking some failure jobs found in "StandardError" file of all check jobs, it complained "connect: Connection refused at -e line 23. connect: Connection refused at -e line 23. Cannot move ISB (${globus_transfer_cmd} gsiftp://lxb7347.cern.ch/opt/glite/var/cream_sandbox/dteam/C_CH_O_CERN_OU_GD_CN_Test_user_9_dteam_Role_NULL_Capability_NULL/CREAM026822414/ISB/test.sh file:///home/dteam025/home_cream_026822414/CREAM026822414/test.sh):"
    • 51 jobs aborted with "BLAH error: + submission command failed (exit code = 1) N/A"

  • 2008-06-20: 10160 1000s-sleep jobs with 49 proxies in 2 threads, there were 30s sleep after every 20 jobs
    • 9819 ( 96.64%) jobs finished successfully
    • 341 jobs aborted with "BLAH error: + submission command failed (exit code = 1) N/A"

  • 2008-06-21: 9800 1000s-sleep jobs with 49 proxies in 4 threads, there were 30s sleep after every 20 jobs
    • 9798 (99.98%) jobs finished successfully
    • 2 jobs aborted with "BLAH error: + submission command failed (exit code = 1) N/A"

  • 2008-06-23: 9800 1000s-sleep jobs with 49 proxies in 8 threads, there were 15s sleep after every 40 jobs.
    • the submission aborted in the middle since the root partition got full. gelxec produced too much logs.

  • 2008-06-24: 9800 900s-sleep jobs with 49 proxies in8 threads, there were 20s sleep after submitting 40 jobs
    • 9800 (100%) jobs finished successfully
    • When submitting jobs, the CPU load is around 3-4, sometime can reach 6, but when submission finished, the load reduced to less 1.

  • 2008-06-25: 9800 1000s-sleep jobs with 49 proxies in 8 thread, there were 15s sleep after every 40 jobs.
    • 9752 (99.51%) job finished successfully
    • 46 jobs aborted with "BLAH error: + submission command failed (exit code = 1) N/A"
    • 2 jobs failed with "pbs_reason=1; connect: Connection refused at -e line 23. connect: Connection refused at -e line 23. Cannot move ISB (${globus_transfer_cmd} gsiftp://lxb7347.cern.ch/opt/glite/var/cream_sandbox/dteam/C_CH_O_CERN_OU_GD_CN_Test_user_6_dteam_Role_NULL_Capability_NULL/CREAM086441101/ISB/test.sh file:///home/dteam045/home_cream_086441101/CREAM086441101/test.sh"

  • 2008-06-26: The submission aborted because the voms server host certificate was updated but not copied to CE. And also the next few days test aborted because of the same reason since the test was done in the weekend.

  • 2008-07-01: 9800 1000s-sleep jobs with 49 proxies in 8 thread, there were 15s sleep after every 40 jobs.
    • 9775 (99.74%) jobs finished successfully
    • 24 jobs aborted with "[BLAH error: + submission command failed (exit code = 1) N/A (jobId = CREAM981004101)]"
    • 1 job in running status for long time (CREAM job id https://lxb7347.cern.ch:8443/CREAM034197757, pbs job id "189475.lxb2035.cern.ch". In torque server log, there are complaining message "Batch protocol error (15031) in send_job, child failed in previous commit request for job 189475.lxb2035.cern.ch" and "unable to run job, MOM rejected/rc=1", and then later "Req;req_reject;Reject reply code=15057(Cannot execute at specified host because of checkpoint or stagein files MSG=allocated nodes must match input file stagein location), aux=0, type=RunJob, from root@lxb2035NOSPAMPLEASE.cern.ch", so it looks like stagein files missing (maybe removed ) after the first submission failed.

  • 2008-07-02: 9800 1000s-sleep jobs with 49 proxies in 8 thread, there were 15s sleep after every 40 jobs.
    • 9778 (99.78%) jobs finished successfully
    • 22 jobs aborted with "[BLAH error: + submission command failed (exit code = 1) N/A (jobId = CREAM981004101)]"

  • 2008-07-03: 9800 1000s-sleep jobs with 49 proxies in 8 thread, there were 15s sleep after every 40 jobs.
    • 9560 (97.55%) successfully submitted
      • 8959 (8959/9560~93.71%) finished successfully
      • 600 jobs aborted with "BLAH error: + submission command failed (exit code = 1)"
      • 1 job in Running status for ever
    • 240 jobs failed to be submitted with "Received NULL fault; the error is due to another cause: : FaultString=[User C=CH,O=CERN,OU=GD,CN=Test user 3 not authorized for {http://www.gridsite.org/namespaces/delegation-2}getProxyReq] - FaultCode=[SOAP-ENV:Server.generalException]" or "Received NULL fault; the error is due to another cause: : FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException]"

  • 2008-07-04: 9800 1000s-sleep jobs with 49 proxies in 8 thread, there were 15s sleep after every 40 jobs.
    • 9480 (96.73%) successfully submitted
      • 9479 (~100%) finished successfully
      • 1 aborted with "BLAH error: + submission command failed (exit code = 1)"
    • 320 jobs failed with "Received NULL fault; the error is due to another cause: : FaultString=[User C=CH,O=CERN,OU=GD,CN=Test user 3 not authorized for {http://www.gridsite.org/namespaces/delegation-2}getProxyReq] - FaultCode=[SOAP-ENV:Server.generalException]" or "Received NULL fault; the error is due to another cause: : FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException]"

  • 2008-07-09: 3920 800s-sleep jobs with 49 proxies in 8 thread, there were 15s sleep after every 40 jobs.
    • 3917 jobs finished successfully
    • 1 job aborted with "[BLAH error: + submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: Invalid credential MSG=cannot authenticate user-No Permission.-qsub: cannot connect to server lxb2035.cern.ch (errno=15007)-) N/A (jobId = CREAM239623634)"
    • 2 jobs in RUNNING status
-- DiQing - 20 Jun 2008
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2011-05-10 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback