CE Stress Testing
Overview
The original acceptance criteria for a CE implementation was outline by the TCG in the following
document. A summary of the basic acceptance criteria is;
- Performance
- 5000 simultaneous jobs per CE node
- 50 user/role/submission node combinations supported on a single CE node
- Reliability
- Job failure rates in normal operations due to the CE <0.5%
- Job failures due to restart of CE services or reboot <0.5%
- 5 days unattended running with performance on day 5 equal to that on day 1
The formal testing procedures for EGEE certification can be found
here.
Initial Testing by Di Qing
The initial testing of various interfaces was overseen by Di Qing. The results of these tests can be found on the following pages.
In summary, there are two kinds of tests, how many unique users (DN/Group/Roes) the CE can handle and how many jobs the CE can handle simultaneously with certain numberof users. The testing scripts used to test the LCG-CE can be found in
/afs/cern.ch/user/d/dqing/public/multipleuser-andrey7.The submit.sh can be viewed to understand how to use the utilities, for example .
"traffic-simulator -max-time 91 -proxy user_proxies/ -wms-list wms_list -users 20 30 75 0"
This result in jobs being submitted for 20 users, the time interval between two submission is 30 minutes, in each submission, 75 jobs will be submitted per user. When the script reaches the max-time, 91 minutes, it exits, thus there will be 4 submissions, and in total 4*20*75=6000 jobs will be submitted.
For CREAM CE, since it was not possible to submit jobs through WMS, the test was done by the scripts provided by INFN which can be found in
/afs/cern.ch/user/d/dqing/public/cream.
LSF Testing Environment
A machine bes.cern.ch has been installed with LSF 7.0 by Chris Smith from Platform Computing. VMware has been used to set up a number of virtual machines to be used as Worker Nodes. The BES interface from LSF has been configured on this machine and can be used to submit jobs to the LSF batch system. The machine also has installed an LCG CE, a Cream CE and an ARC CE all of which use the underlaying LSF batch system. The machine vtb-generic-52 has been installed with a version of the WMS which can submit to all these CEs with the exception of the BES interface.
PNPI Testing
A testing framework has been written which can be used to stress test CEs. More details on the test suite and results from testing can be found
here
and
here
.
Test Plan
- Testing of CE performance via WMS submission
- Testing that the Cream CE meets that official acceptance criteria
- Testing of CE performance via direct submission
- Testing the performance of the Condor-G submission to the Cream CE
- The Performance of ICE submission to many Cream CEs
Direct Submission
LSF
Installation
mount lxbra1908.cern.ch:/share
. /share/lsf/conf/profile.lsf
cd /share/lsf/7.0/linux2.6-glibc2.3-x86/bin
LSF Direct Job Submission
bsub -m bes.cern.ch -o std-1.out -e std-1.erro ./testjob.sh
LSF Status Query
bjobs -m bes.cern.ch -u all -a
BES LSF
Installation
mount lxbra1908.cern.ch:/share
cd /share/hpcp/
BES Job Submission
<?xml version="1.0" encoding="UTF-8"?>
<JobDefinition xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl">
<JobDescription>
<JobIdentification>
<JobName>Sleep</JobName>
<JobProject>BES</JobProject>
</JobIdentification>
<Application>
<HPCProfileApplication
xmlns="http://schemas.ggf.org/jsdl/2006/07/jsdl-hpcpa">
<Executable>sleep</Executable>
<Argument>300</Argument>
<Output>/dev/null</Output>
<WorkingDirectory>/tmp</WorkingDirectory>
</HPCProfileApplication>
</Application>
<Resources>
<TotalCPUCount>
<Exact>1</Exact>
</TotalCPUCount>
</Resources>
</JobDescription>
</JobDefinition>
./besclient -u csmith -p xxxxxxx create sleep.xml sleep-1.epr
BES Job Status
./besclient -u csmith -p xxxxxxxx status sleep1.epr
Cream
Installation
mount lxbra1908.cern.ch:/share
yum install expat log4cpp
export LD_LIBRARY_PATH=/share/glite/lib:/share/glite/globus/lib:/share/glite/external/opt/c-ares/lib/:/share/glite/external/opt/
cd /share/glite/bin
Job Submission
[
JobType = "Normal";
Executable = "/bin/date";
Arguments = "";
StdOutput="out.txt";
StdError="err.txt";
OutputSandbox = {"out.txt", "err.txt"};
OutputSandboxBaseDestUri = "gsiftp://lxbra1908.cern.ch/tmp/";
]
glite-ce-job-submit --autm-delegation --resource lxbra1908.cern.ch:8443/cream-lsf-normal example.jdl
BES Job Status
glite-ce-job-status https://lxbra1908.cern.ch:8443/CREAM830810312
ARC
Installation
mount lxbra1908.cern.ch:/share
yum install libxml2 libtool-libs openldap
export LD_LIBRARY_PATH=/share/nordugrid/lib
export GLOBUS_LOCATION=/share/glite/globus/
cd /share/nordugrid/bin/
Job Submission
&
(executable=/bin/echo)
(arguments="Hello World" )
(stdout="hello.txt")
(stderr="hello.err")
(* Grid Manager auxilliary logs will be stored in this directory: *)
(* gmlog="gridlog")
(jobname="My Hello Grid")
./ngsub -c bes.cern.ch -f arc-rsl
Job Status
./ngstat gsiftp://bes.cern.ch:2811/jobs/5131235049518197584921
<verbatim>
</verbatim>
<nop>