SGE stress tests of lcg-CE
Overview
- These tests are based on work made by Nikos Voutsinas, Dimitrios Apostolou and Kostantinos Koukopoulos (see references). We had to change some scripts and commands for being used by SGE system.
- Only a WN with a free slot was used for this testbed.
Test Scripts
We use these sensors running in our nodes to get information from
SGE like memory usage, jobs enqueued, etc. We can download the newer version of these scripts from here:
https://twiki.cern.ch/twiki/bin/view/LCG/SGE_Stress#SGE_CVS
- Monitoring CE monitor_CEsge.sh
#!/bin/bash
VO=${1:-dteam}
USER=${1:-dteam091} #Change for different qsub users
DATE=`date -Iminutes`
OUTCE="IDs/$DATE"_".CE.out
OUTBDII="IDs/$DATE"_".BDII.out
TIME=`date -Iminutes`
LA=`sed -e 's/^\([^ ]*\).*\/\([^ ]*\).*/\1 \2/' /proc/loadavg`
SGE_QMASTER=`ps auxw|grep sge_qmaster|awk '{print $3" "$4}'|awk 'BEGIN{cpu=0;sz=0}{cpu+=$1;sz+=$2}END{print cpu" "sz}'`
SGE_SCHEDD=`ps auxw|grep sge_schedd|awk '{print $3" "$4}'|awk 'BEGIN{cpu=0;sz=0}{cpu+=$1;sz+=$2}END{print cpu" "sz}'`
BLPARSER=`ps auxw|grep globus-job-manager-script.pl|awk '{print $3" "$4}'|awk 'BEGIN{cpu=0;sz=0}{cpu+=$1;sz+=$2}END{print cpu" "sz}'`
JOBS=`qstat -q $VO|awk 'BEGIN {r=0;o=0}/'"$VO"'/{if ($5 == "r") r++ ;if ($5 ~ /[qweht]/) o++} END {print r" "o}'`
COMPLETED=`qstat -s z -u $USER | awk 'BEGIN { cont=0 }{ cont++; if ( cont > 3 ) print $1}' | wc -l`
[ $? -eq 0 ] || COMPLETED="0"
echo -e "$TIME\t$LA\t$BLPARSER\t$SGE_SCHEDD\t$SGE_QMASTER\t$JOBS\t$COMPLETED"
- Monitoring WN monitor_WNsge.sh
#!/bin/bash
SGE_EXECD=`ps --no-headers -C sge_execd -o %cpu,size|awk 'BEGIN{cpu=0;sz=0} {cpu+=\$1;sz+=\$2} END{print cpu" "sz}'`
TIME=`date -Iminutes`
LA=`sed -e 's/^\([^ ]*\).*\/\([^ ]*\).*/\1 \2/' /proc/loadavg`
echo -e "$TIME\t$LA\t$SGE_EXECD"
- Monitoring BDII on CE monitor_BDIIsge.sh
#!/bin/bash
VO="${1:-dteam}"
CE="${2:-sa3-ce.egee.cesga.es}"
DATE=`date -Iminutes`
OUTCE="IDs/$DATE"_".CE.out
OUTBDII="IDs/$DATE"_".BDII.out
GIP_QUERY="/opt/lcg/libexec/lcg-info-dynamic-sge 2>/dev/null"
BDII_PORT=2170
BDII_BASE="mds-vo-name=cesga-sa3,o=grid"
BDII_QUERY="(&(GlueForeignKey=GlueClusterUniqueID=$CE)(GlueCEName=$VO))"
BDII_ATTRS="GlueCEStateRunningJobs GlueCEStateWaitingJobs GlueCEStateTotalJobs"
BDII_QUERY="ldapsearch -h $CE -p $BDII_PORT -x -b $BDII_BASE $BDII_QUERY $BDII_ATTRS"
PRINTJOBS='/RunningJobs/{r=$2}/TotalJobs/{t=$2}/WaitingJobs/{w=$2}END{print r" "t" "w}'
TIME=`date -Iminutes`
LA=`sed -e 's/^\([^ ]*\).*\/\([^ ]*\).*/\1 \2/' /proc/loadavg`
GIP=`/opt/lcg/libexec/lcg-info-dynamic-sge 2>/dev/null|sed -ne "/$VO/,/^$/p"| awk "$PRINTJOBS"`
LDAP=`$BDII_QUERY |awk "$PRINTJOBS"`
[ $? -eq 0 ] || LDAP="0 0 0"
echo -e "$TIME\t$LA\t$GIP\t$LDAP"
Gnuplot Scripts
# wms-lrms.sh outwms outCE_wms graph.png
WMS_INPUT=$1
LRMS_INPUT=$2
OUTPUT=$3
TICS=$4
(
cat<<EOF
set grid
set key left nobox
set xtics border nomirror norotate $TICS
set ytics border nomirror norotate
set y2tics border nomirror norotate
set style data lines
set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S+0200"
set format x "%H:%M"
set xlabel "Time"
set ylabel "Jobs"
set title "WMS vs LRMS status report"
set terminal png size 600,300
EOF
[ -n "$OUTPUT" ] && echo "set output '$OUTPUT'"
echo "plot '$WMS_INPUT' using 1:9 title 'WMS Ready' with lines lw 3, \
'$WMS_INPUT' using 1:3 title 'WMS Done' with lines lw 3, \
'$WMS_INPUT' using 1:7 title 'WMS Running' with lines lw 3, \
'$WMS_INPUT' using 1:5 title 'WMS Scheduled' with lines lw 3, \
'$LRMS_INPUT' using 1:12 title 'LRMS Done' with lines lw 3"
)|gnuplot
# gip-bdii.sh outBDII graph.png
INPUT=$1
OUTPUT=$2
TICS=$3
(
cat<<EOF
set grid
set key left nobox
set xtics border nomirror norotate $TICS
set ytics border nomirror norotate
set style data lines
set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S+0200"
set format x "%H:%M"
set xlabel "Time"
set ylabel "Jobs"
set title "LRMS Queue Length"
set terminal png size 500,300
EOF
echo "set output '$OUTPUT'"
echo "plot '$INPUT' using 1:6 title 'Waiting (GIP)' with lines lw 3, \
'$INPUT' using 1:9 title 'Waiting (BDII)' with lines lw 3"
)|gnuplot
# loads.sh outCE outWN outCE graph.png
INPUT1=$1
INPUT2=$2
INPUT3=$3
OUTPUT=$4
TICS=$5
(
cat<<EOF
set grid
set key left nobox
set xtics border nomirror norotate $TICS
set ytics border nomirror norotate
set style data lines
set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S+0200"
set format x "%H:%M"
set xlabel "Time"
set ylabel "1 min load average"
set title "Average Loads"
set terminal png size 500,300
EOF
echo "set output '$OUTPUT'"
echo "plot '$INPUT1' using 1:2 title 'lcgCE' with lines lw 3, \
'$INPUT2' using 1:2 title 'WN_SGE' with lines lw 3, \
'$INPUT3' using 1:8 title 'SGE_server' with lines lw 3"
)|gnuplot
# sgeqmaster-mem.sh outCE graph.png
INPUT=$1
OUTPUT=$2
TICS=$3
(
cat<<EOF
set mouse
set grid
set key left nobox
set xtics border nomirror norotate $TICS
set ytics border nomirror norotate
set y2tics border nomirror norotate
set style data lines
set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S+0200"
set format x "%H:%M"
set xlabel "Time"
set ylabel "Size (x1000k)"
set y2label "Queued Jobs"
set title "SGE Qmaster Process Size"
set terminal png size 600,300
EOF
echo "set output '$OUTPUT'"
echo "plot '$INPUT' using 1:11 axes x1y2 title 'Jobs' with lines lw 3, \
'$INPUT' using 1:9 title 'SGE Qmaster Footprint' with lines lw 3"
)|gnuplot
# sgeschedd-mem.sh outCE graph.png
INPUT=$1
OUTPUT=$2
TICS=$3
(
cat<<EOF
set mouse
set grid
set key left nobox
set xtics border nomirror norotate $TICS
set ytics border nomirror norotate
set y2tics border nomirror norotate
set style data lines
set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S+0200"
set format x "%H:%M"
set xlabel "Time"
set ylabel "Size (x1000k)"
set y2label "Queued Jobs"
set title "SGE Schedd Process Size"
set terminal png size 600,300
EOF
echo "set output '$OUTPUT'"
echo "plot '$INPUT' using 1:11 axes x1y2 title 'Jobs' with lines lw 3, \
'$INPUT' using 1:7 title 'SGE Schedd Footprint' with lines lw 3"
)|gnuplot
Job Submitting
- Script job_sub.sh. This script sends a NUM jobs using a wms creating IDs directory to storage logs.
#!/bin/bash
DATE=`date -Iminutes`
SITE=${1:-cern}
NUM=${2:-20}
JOB=${3:-sa3-200.jdl}
WMS=$4
IDS="IDs/$DATE"_"$SITE"
OUT="IDs/$DATE"_"$SITE"_"$NUM".out
STATUS="IDs/$DATE"_"$SITE"_"$NUM".status
#mkdir IDs
if [ $SITE == "uoa" ];then
WMS="https://ctb06.gridctb.uoa.gr:7443/glite_wms_wmproxy_server"
elif [ $SITE == "cern" ];then
#WMS="https://lxb2032.cern.ch:7443/glite_wms_wmproxy_server"
WMS="https://wms01.egee.cesga.es:7443/glite_wms_wmproxy_server"
elif [ $SITE == "cy" ];then
WMS="https://wmslb201.grid.ucy.ac.cy:7443/glite_wms_wmproxy_server"
fi
DELAY=1
echo "$DATE"> $OUT
echo "$DATE"> $STATUS
time for i in `seq 1 $NUM`; do
glite-wms-job-submit -a -o $IDS -c /opt/cesga/lcg-sa3/glite/etc/dteam/glite_wms.conf -e $WMS $JOB
sleep $DELAY
done>>$OUT 2>&1 &
while true
do
SEC=`date +%S`
if [ $SEC -ge "59" -o $SEC -le "1" ];then
JOBS="`glite-wms-job-status -c /opt/cesga/lcg-sa3/glite/etc/dteam/glite_wms.conf --noint --logfile /dev/null -i $IDS |awk '
BEGIN {a["Done"] =a["Ready"] =a["Running"]=a["Scheduled"]=0;
a["Undefined"]=a["Submitted"]=a["Waiting"]=a["Cleared"] =0;
a["Aborted"] =a["Cancelled"]=a["Purged"] =a["Unknown"] =0;}
/Current Status:[ ]*([^ ]*)/ {a[$3]+=1;total+=1}
END {if (total==0) exit 1;
for (i in a){
printf("%s=%s\t",i,a[i]);}
printf("\n");}'`"
if [ $? -eq 0 ];then
CURRENT_DATE=`date -Iminutes`
echo -e "$CURRENT_DATE\t$JOBS" >> $STATUS
WAIT_JOBS=`echo "$JOBS" |egrep -qi "(SUBMITTED=[^0]|WAITING=[^0]|READY=[^0]|SCHEDULED=[^0]|RUNNING=[^0])";echo $?`
if [ "$WAIT_JOBS" -ne 0 ];then
break;
fi
OUT="IDs/$DATE"_"$SITE"_"$NUM".out
STATUS="IDs/$DATE"_"$SITE"_"$NUM".status
if [ $SITE == "uoa" ];then
WMS="https://ctb06.gridctb.uoa.gr:7443/glite_wms_wmproxy_server"
elif [ $SITE == "cern" ];then
#WMS="https://lxb2032.cern.ch:7443/glite_wms_wmproxy_server"
WMS="https://wms01.egee.cesga.es:7443/glite_wms_wmproxy_server"
elif [ $SITE == "cy" ];then
WMS="https://wmslb201.grid.ucy.ac.cy:7443/glite_wms_wmproxy_server"
fi
DELAY=1
echo "$DATE"> $OUT
echo "$DATE"> $STATUS
time for i in `seq 1 $NUM`; do
glite-wms-job-submit -a -o $IDS -c /opt/cesga/lcg-sa3/glite/etc/dteam/glite_wms.conf -e $WMS $JOB
sleep $DELAY
done>>$OUT 2>&1 &
while true
do
SEC=`date +%S`
if [ $SEC -ge "59" -o $SEC -le "1" ];then
JOBS="`glite-wms-job-status -c /opt/cesga/lcg-sa3/glite/etc/dteam/glite_wms.conf --noint --logfile /dev/null -i $IDS |awk '
BEGIN {a["Done"] =a["Ready"] =a["Running"]=a["Scheduled"]=0;
a["Undefined"]=a["Submitted"]=a["Waiting"]=a["Cleared"] =0;
a["Aborted"] =a["Cancelled"]=a["Purged"] =a["Unknown"] =0;}
/Current Status:[ ]*([^ ]*)/ {a[$3]+=1;total+=1}
END {if (total==0) exit 1;
for (i in a){
printf("%s=%s\t",i,a[i]);}
printf("\n");}'`"
if [ $? -eq 0 ];then
CURRENT_DATE=`date -Iminutes`
echo -e "$CURRENT_DATE\t$JOBS" >> $STATUS
WAIT_JOBS=`echo "$JOBS" |egrep -qi "(SUBMITTED=[^0]|WAITING=[^0]|READY=[^0]|SCHEDULED=[^0]|RUNNING=[^0])";echo $?`
if [ "$WAIT_JOBS" -ne 0 ];then
break;
fi
sleep 10
fi
fi
sleep 2
done
- Script qsub_test.sh. This script sends $1 jobs using sge qsub by user.
#!/bin/sh
# --> All jobs run in high priority to execute before others, since they are
# simple but important. High priority is represented by high value!
USER="esfreire"
GROUP="cesga"
QUEUE="cesga"
HOWMANY="$1"
TOTAL=`echo $HOWMANY - 1 | bc` # Because if we execute the script with 200 the qsub only execute 199
JOB='cd $HOME;date;hostname;whoami;pwd';
if [ -z $HOWMANY ];then
exit 2
fi
trap "exit 2" 1 2 3 15
# Create a unique temporary directory, else exit with error
TMPDIR=`mktemp -d $HOME/qsub_test-XXXXXXXX` || {
echo Could not create temporary directory in /home/esfreire/ ; exit 1; }
# Clean up the TMPDIR always before leaving the script
trap "rm -rf $TMPDIR" 0
JOBIDS="$TMPDIR/jobids"
rm -f $JOBIDS
echo "START TIME: `date -Iminutes`"
# Then check for succesful execution of HOWMANY jobs, meaning all of them
echo Submitting $HOWMANY simple jobs...
for JOBNO in `seq 1 $HOWMANY`
do
JOBDIR="$TMPDIR/$JOBNO"
mkdir $JOBDIR
cd $JOBDIR
echo $JOB | qsub -p 1024 >> $JOBIDS
done
# Wait for all jobs to complete, or be removed somehow from queue.
# WARNING: this could wait for too much so this script should be supervised,
# either by the user or by another script
echo Waiting all jobs to end...
COMPLETED=`qstat -s z -u $USER | awk 'BEGIN { cont=0 }{ cont++; if ( cont > 3 ) print $1}' | wc -l`
while [ $COMPLETED != $TOTAL ]
do
sleep 30
COMPLETED=`qstat -s z -u $USER | awk 'BEGIN { cont=0 }{ cont++; if ( cont > 3 ) print $1}' | wc -l`
echo "$COMPLETED jobs completed..."
done
echo "DONE!"
# Now that we know the jobs are done, check stdout and stderr of all jobs
echo Checking successful execution...
echo A simple dot means job success, an exclamation mark is failure!
FAIL=no
for JOBNO in `seq 1 $HOWMANY`
do
JOBOUT="$TMPDIR/$JOBNO/STDIN.o*"
JOBERR="$TMPDIR/$JOBNO/STDIN.e*"
if [ `cat $JOBERR 2>/dev/null | wc -l` != 0 -o `cat $JOBOUT 2>/dev/null | wc -l` == 0 ]
then
echo -n !
FAIL=yes
else
echo -n .
fi
done
echo
echo "END TIME: `date -Iminutes`"
[ $FAIL == yes ] && exit 1
- Script qsub1.sh. This is a stress script submitting a large number of jobs and after delete them immediately.
#!/bin/bash
NUM_OF_ITERATIONS=${1:-2};
NUM_OF_JOBS=${2:-10};
SLEEP_TIME=${3:-604800};
USER="cesga025";
QUEUE="cesga"
FILE_OUT=`mktemp $HOME/qsub_stress_out_XXXXXXXX`
FILE_ERR=`mktemp $HOME/qsub_stress_err_XXXXXXXX`
exec 1<>$FILE_OUT
exec 2<>$FILE_ERR
submit_job () {
BS_JOB_ID="";
#BS_JOB_ID=`su - $USER -c "echo sleep $SLEEP_TIME | qsub -q $QUEUE -o $TMPDIR/tmpout -e $TMPDIR/tmperr"`
BS_JOB_ID=`echo sleep $SLEEP_TIME | qsub -q $QUEUE -o $HOME/stdout -e $HOME/stderr | cut -d" " -f3`
}
jobs_iteration () {
for JOB_NUM in `seq 1 $NUM_OF_JOBS`;do
echo -e "0\t$INTER_NUM\t$JOB_NUM"
submit_job && JOBS_ARRAY[JOB_NUM]=$BS_JOB_ID;
done
for JOB_NUM in `seq 1 $NUM_OF_JOBS`;do
echo -e "1\t$INTER_NUM\t$JOB_NUM\t${JOBS_ARRAY[JOB_NUM]}"
qdel ${JOBS_ARRAY[JOB_NUM]}
done
}
for ITER_NUM in `seq 1 $NUM_OF_ITERATIONS`;do
jobs_iteration &
# this wait a hour and it returns to send the jobs + 1000
sleep 3600
NUM_OF_JOBS=`echo $NUM_OF_JOBS + 1000 | bc`
done
wait
mv $FILE_OUT qsub_stress_`date -Iminutes`_"$NUM_OF_ITERATIONS"_"$NUM_OF_JOBS".out
mv $FILE_ERR qsub_stress_`date -Iminutes`_"$NUM_OF_ITERATIONS"_"$NUM_OF_JOBS".err
exit 0;
Performed Tests
BDII
- Check Batch System information published through BDII
>ldapsearch -h sa3-ce.egee.cesga.es -p 2170 -x -b mds-vo-name=cesga-sa3,o=grid "(&(GlueForeignKey=GlueClusterUniqueID=sa3-ce.egee.cesga.es)""(GlueCEName=dteam))" GlueCEInfoHostName GlueCEInfoLRMSType GlueCEInfoLRMSVersion GlueCEInfoJobManager
#
# filter: (&(GlueForeignKey=GlueClusterUniqueID=sa3-ce.egee.cesga.es)(GlueCEName=dteam))
# requesting: GlueCEInfoHostName GlueCEInfoLRMSType GlueCEInfoLRMSVersion GlueCEInfoJobManager
#
# sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam, CESGA-SA3, grid
dn: GlueCEUniqueID=sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam,mds-vo-na
me=CESGA-SA3,o=grid
GlueCEInfoHostName: sa3-ce.egee.cesga.es
GlueCEInfoLRMSType: sge
GlueCEInfoLRMSVersion: 6.0u7
GlueCEInfoJobManager: lcgsge
# search result
search: 2
result: 0 Success
# numResponses: 2
# numEntries: 1
QUEUES
- Check Batch System configuration
>qconf -sq '*'
qname biomed
hostlist sa3-ce.egee.cesga.es sa3-wn001.egee.cesga.es
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 19
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make
rerun FALSE
slots 1
tmpdir /tmp
shell /bin/csh
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists biomed
xuser_lists NONE
subordinate_list NONE
...
>qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
sa3-ce - - - - - - -
sa3-wn001 lx26-x86 1 0.00 248.2M 19.1M 509.8M 3.9M
>ls /usr/local/sge/V60u7_1/default/spool/qmaster/exec_hosts/
global sa3-ce.egee.cesga.es sa3-wn001.egee.cesga.es template
- View queues configuration
>qconf -sconf
global:
execd_spool_dir /opt/cesga/sge60/default/spool
mailer /bin/mail
xterm /usr/bin/X11/xterm
load_sensor none
prolog none
epilog none
shell_start_mode unix_behavior
login_shells sh,ksh,csh,tcsh,bash
min_uid 0
min_gid 0
user_lists none
xuser_lists none
projects none
xprojects none
enforce_project false
....
>
qconf -se sa3-wn001.egee.cesga.es
hostname sa3-wn001.egee.cesga.es
load_scaling NONE
complex_values NONE
load_values arch=lx26-x86,num_proc=1,mem_total=248.183594M, \
swap_total=509.835938M,virtual_total=758.019531M, \
load_avg=0.000000,load_short=0.000000, \
load_medium=0.000000,load_long=0.000000, \
mem_free=229.195312M,swap_free=505.984375M, \
virtual_free=735.179688M,mem_used=18.988281M, \
swap_used=3.851562M,virtual_used=22.839844M, \
cpu=0.000000,np_load_avg=0.000000, \
np_load_short=0.000000,np_load_medium=0.000000, \
np_load_long=0.000000
processors 1
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
Network WN:
NETWORK
>netstat -apt
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 *:sunrpc *:* LISTEN 687/portmap
tcp 0 0 *:32786 *:* LISTEN 706/rpc.statd
tcp 0 0 *:8659 *:* LISTEN 5175/gmond
tcp 0 0 xen.egee.cesga.es:32788 *:* LISTEN 876/xinetd
tcp 0 0 *:ssh *:* LISTEN 862/sshd
tcp 0 0 xen.egee.cesga.es:smtp *:* LISTEN 918/sendmail: accep
tcp 0 0 *:sge_execd *:* LISTEN 854/sge_execd
tcp 0 144 sa3-wn001.egee.cesga.es:ssh paco.murcia.cesga.es:37513 ESTABLISHED 9006/0
tcp 0 0 sa3-wn001.egee.cesga.:35930 sa3-ce.egee.ces:sge_qmaster ESTABLISHED 854/sge_execd
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 *:globus-gatekeeper *:* LISTEN 7963/edg-gatekeeper
tcp 0 0 *:9002 *:* LISTEN 9861/edg-wl-logd
tcp 0 0 *:mysql *:* LISTEN 3917/mysqld
tcp 0 0 *:8659 *:* LISTEN 4182/gmond
tcp 0 0 sa3-ce.egee.cesga.es:2135 *:* LISTEN 6325/slapd
tcp 0 0 *:sge_qmaster *:* LISTEN 10030/sge_qmaster
tcp 0 0 localhost.localdomain:smtp *:* LISTEN 792/sendmail: accep
tcp 0 0 *:2170 *:* LISTEN 28795/bdii-fwd [acc
tcp 0 0 localhost.localdomain:2171 *:* LISTEN 10154/slapd
tcp 0 0 *:2811 *:* LISTEN 9542/ftpd: acceptin
tcp 0 0 localhost.localdomain:2172 *:* LISTEN 10687/slapd
tcp 0 0 localhost.localdomain:2173 *:* LISTEN 9666/slapd
tcp 0 0 sa3-ce.egee.ces:sge_qmaster sa3-wn001.egee.cesga.:35930 ESTABLISHED 10030/sge_qmaster
tcp 0 0 sa3-ce.egee.cesga.es:58991 sa3-mon.egee.cesga.es:12409 ESTABLISHED 28644/edg-fmon-agen
tcp 0 0 sa3-ce.egee.cesga.es:2135 sa3-ce.egee.cesga.es:40588 TIME_WAIT -
tcp 0 0 sa3-ce.egee.cesga.es:56406 sa3-ce.egee.ces:sge_qmaster TIME_WAIT -
tcp 0 0 sa3-ce.egee.cesga.es:56405 sa3-ce.egee.ces:sge_qmaster TIME_WAIT -
tcp 0 0 sa3-ce.egee.cesga.es:8659 www2.egee.cesga.es:47588 TIME_WAIT -
tcp 0 0 sa3-ce.egee.ces:sge_qmaster sa3-ce.egee.cesga.es:51785 ESTABLISHED 10030/sge_qmaster
tcp 0 0 localhost.localdomain:2171 localhost.localdomain:60003 TIME_WAIT -
tcp 0 0 sa3-ce.egee.cesga.es:8659 www2.egee.cesga.es:47596 TIME_WAIT -
tcp 0 0 localhost.localdomain:2172 localhost.localdomain:51367 TIME_WAIT -
tcp 0 0 localhost.localdomain:60003 localhost.localdomain:2171 TIME_WAIT -
tcp 0 0 sa3-ce.egee.cesga.es:51785 sa3-ce.egee.ces:sge_qmaster ESTABLISHED 10063/sge_schedd
tcp 0 0 sa3-ce.egee.cesga.es:8659 www2.egee.cesga.es:47603 TIME_WAIT -
tcp 0 0 localhost.localdomain:51366 localhost.localdomain:2172 TIME_WAIT -
tcp 0 0 sa3-ce.egee.cesga.es:8659 www2.egee.cesga.es:47572 TIME_WAIT -
tcp 0 0 sa3-ce.egee.cesga.es:45480 sa3-mon.egee.cesga.es:2135 TIME_WAIT -
tcp 0 0 sa3-ce.egee.cesga.es:8659 www2.egee.cesga.es:47580 TIME_WAIT -
tcp 0 0 *:ssh *:* LISTEN 29087/sshd
tcp 0 2304 sa3-ce.egee.cesga.es:ssh paco.murcia.cesga.es:57729 ESTABLISHED 6591/0
>cat /usr/local/sge/V60u7_1/default/spool/qmaster/messages
03/29/2007 18:05:33|qmaster|sa3-ce|I|read job database with 0 entries in 0 seconds
03/29/2007 18:05:33|qmaster|sa3-ce|I|qmaster hard descriptor limit is set to 1024
03/29/2007 18:05:33|qmaster|sa3-ce|I|qmaster soft descriptor limit is set to 1024
03/29/2007 18:05:33|qmaster|sa3-ce|I|qmaster will use max. 1004 file descriptors for communication
03/29/2007 18:05:33|qmaster|sa3-ce|I|qmaster will accept max. 99 dynamic event clients
03/29/2007 18:05:33|qmaster|sa3-ce|I|starting up 6.0u7
...
JOBS
>vi sa3.jdl
JobType = "Normal";
ShallowRetryCount = 0;
RetryCount = 0;
Executable = "sa3.sh";
Arguments = "180 1";
StdOutput = "sa3.out";
StdError = "sa3.err";
InputSandbox = {"sa3.sh"};
OutputSandbox = {"sa3.out","sa3.err"};
#Rank = other.GlueCEStateFreeCPUs;
Requirements = other.GlueCEUniqueID == "sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam"
>vi sa3.sh
#!/bin/bash
SLEEP_TIME=${1:-0};
DO_MD5=${2:-0};
FILE_MD5=${3:-/dev/urandom};
NUM_OF_BG_JOBS=2
date
hostname
whoami
pwd
if [ "$DO_MD5" != "0" -a -r $FILE_MD5 ];then
for i in `seq 1 $NUM_OF_BG_JOBS`;do
md5sum $FILE_MD5 &
job_id=$!
jobs_array[i]=$job_id
done
fi
if [ "$SLEEP_TIME" != "0" ];then
sleep $SLEEP_TIME
fi
if [ "$DO_MD5" != "0" -a -f $FILE_MD5 ];then
for i in seq `1 $NUM_OF_BG_JOBS`;do
kill ${jobs_array[i]}
done
fi
exit 0;
>for i in `seq 1 10`; do edg-job-submit --vo dteam -o test.jid sa3.jdl;done
[root@sa3-ce libexec]# qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
631 0.00000 STDIN dteam001 qw 04/30/2007 18:22:29 1
632 0.00000 STDIN dteam001 qw 04/30/2007 18:22:33 1
- Scheduler Info Provider: On site BDII:
# sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam, CESGA-SA3, grid
dn: GlueCEUniqueID=sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam,mds-vo-na
me=CESGA-SA3,o=grid
objectClass: GlueCETop
objectClass: GlueCE
objectClass: GlueSchemaVersion
objectClass: GlueCEAccessControlBase
objectClass: GlueCEInfo
objectClass: GlueCEPolicy
objectClass: GlueCEState
objectClass: GlueInformationService
objectClass: GlueKey
GlueCEHostingCluster: sa3-ce.egee.cesga.es
GlueCEName: dteam
GlueCEUniqueID: sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam
GlueCEInfoGatekeeperPort: 2119
GlueCEInfoHostName: sa3-ce.egee.cesga.es
GlueCEInfoLRMSType: sge
GlueCEInfoLRMSVersion: 6.0u7
GlueCEInfoTotalCPUs: 2
GlueCEInfoJobManager: lcgsge
GlueCEInfoContactString: sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam
GlueCEInfoApplicationDir: /opt/exp_soft
GlueCEInfoDataDir: unset
GlueCEInfoDefaultSE: sa3-se.egee.cesga.es
GlueCEStateEstimatedResponseTime: 2159906
GlueCEStateFreeCPUs: 0
GlueCEStateRunningJobs: 1
GlueCEStateStatus: Production
GlueCEStateTotalJobs: 8
GlueCEStateWaitingJobs: 7
GlueCEStateWorstResponseTime: 4319813
GlueCEStateFreeJobSlots: 0
GlueCEPolicyMaxCPUTime: 4320
GlueCEPolicyMaxRunningJobs: 2
GlueCEPolicyMaxTotalJobs: 0
GlueCEPolicyMaxWallClockTime: 9000
GlueCEPolicyPriority: 1
GlueCEPolicyAssignedJobSlots: 0
GlueCEAccessControlBaseRule: VO:dteam
GlueForeignKey: GlueClusterUniqueID=sa3-ce.egee.cesga.es
GlueInformationServiceURL: ldap://sa3-ce.egee.cesga.es:2135/mds-vo-name=local,
o=grid
GlueSchemaVersionMajor: 1
GlueSchemaVersionMinor: 2
- Scheduler Info Provider with heavy LRMS load:
# sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam, CESGA-SA3, grid
dn: GlueCEUniqueID=sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam,mds-vo-na
me=CESGA-SA3,o=grid
objectClass: GlueCETop
objectClass: GlueCE
objectClass: GlueSchemaVersion
objectClass: GlueCEAccessControlBase
objectClass: GlueCEInfo
objectClass: GlueCEPolicy
objectClass: GlueCEState
objectClass: GlueInformationService
objectClass: GlueKey
GlueCEHostingCluster: sa3-ce.egee.cesga.es
GlueCEName: dteam
GlueCEUniqueID: sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam
GlueCEInfoGatekeeperPort: 2119
GlueCEInfoHostName: sa3-ce.egee.cesga.es
GlueCEInfoLRMSType: sge
GlueCEInfoLRMSVersion: 6.0u7
GlueCEInfoTotalCPUs: 2
GlueCEInfoJobManager: lcgsge
GlueCEInfoContactString: sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam
GlueCEInfoApplicationDir: /opt/exp_soft
GlueCEInfoDataDir: unset
GlueCEInfoDefaultSE: sa3-se.egee.cesga.es
GlueCEStateEstimatedResponseTime: 12689927
GlueCEStateFreeCPUs: 0
GlueCEStateRunningJobs: 1
GlueCEStateStatus: Production
GlueCEStateTotalJobs: 47
GlueCEStateWaitingJobs: 46
GlueCEStateWorstResponseTime: 25379855
GlueCEStateFreeJobSlots: 0
GlueCEPolicyMaxCPUTime: 4320
GlueCEPolicyMaxRunningJobs: 2
GlueCEPolicyMaxTotalJobs: 0
GlueCEPolicyMaxWallClockTime: 9000
GlueCEPolicyPriority: 1
GlueCEPolicyAssignedJobSlots: 0
GlueCEAccessControlBaseRule: VO:dteam
GlueForeignKey: GlueClusterUniqueID=sa3-ce.egee.cesga.es
GlueInformationServiceURL: ldap://sa3-ce.egee.cesga.es:2135/mds-vo-name=local,
o=grid
GlueSchemaVersionMajor: 1
GlueSchemaVersionMinor: 2
ACCOUNTING
>qacct -b 0705011000 -o dteam001
OWNER WALLCLOCK UTIME STIME CPU MEMORY IO IOW
=========================================================================================================================
dteam001 881 27 24 738 2.527 0.000 0.000
- Number of jobs runned by VO:
[esfreire@sa3-ce esfreire]$ qacct -j -g dteam | grep jobnumber | wc -l
4488
BATCH SYSTEM
Description:
About 4 simple jobs with a sleep of 180s are submitted directly to the LRMS. While the jobs are running, using the command-line interface of
SGE, in the Worker Node the daemon of
SGE is switched to down state. Afterwards they are switched to online state again. Moreover the resilience of the LRMS is checked by shutting down sge_qmaste and sge_scheduler and starting it again after a while.
Comments:
If we stopped the qmaster and the scheduler in the CE etc/init.d/sgemaster stop, we don't see the jobs running but when we restart it , we see the jobs running again, and them finished without problems. However if we stopped el SGE_execd in the WN, when we restart it the job is killed.
In
SGE for changing the state the WN like offline, a form to do it would be configuring num_proc like conplex in WN and configuring num_proc=0 when we do not want the jobs to enter this node, but we have not tested it in the current testbed but we are using it in other CESGA's clusters.
Results
Jobs Status
- Sending 200 jobs from WMS and a RB.
We can observe that WMS job submission is faster than RB and WMS reports jobs done a little slower (only a few minutes) than LRMS.
- Sending 200 jobs using qsub.
In LRMS queue length we can see that sge
GIP response is fast, a minute delay in the worst case respect bdii. The WN sge and sge_server average load is very low.
Sge_schedduler memory footprint decreases very fast in a short time, meanwhile Sge_qmaster has a different behavior with some peaks during jobs execution.
- Sending 1000 jobs using qsub.
GIP response remains constant like the previous case.WN sge is more loaded than CE (CE load average is the same that with 200 jobs)
With 1000 jobs Sge_schedduler memory was free in 2 hours but Sge_qmaster remains constant.
Stress Tests
- Stressing the LRMS memory management.
Submitting 1000,2000...7000 jobs and deleting them with different time intervals. With a large number of jobs Sge_schedduler experiments some used memory peaks but it decreases about a hour, in 12 hours the used memory is constant. After several hours we submit 7000 jobs and we got another memory peak with a fast reduction memory usage (about 2 hours).
Sge_Qmaster uses more memory than the schedduler, maximum usage was a peak of 500M. The memory usage was constant after the peaks with slow reductions.
- Behavior of LRMS memory management after stress tests during a large time period.
After 24h with a normal load the memory usage returns to initial state.
Notes
About SGE queue

To do the tests, we had to change the default configuration of
SGE. Specifically we changed the scheduler (qconf -msconf) and global (qconf -mconf) configuration so
SGE accepts 7000 jobs and
qstat -s z also displays 7000 jobs.
- For successive tests, it's the same operation, i.e. to send 2000, 4000.. jobs
qconf -msconf
...
maxujobs 7000
max_functional_jobs_to_schedule 7000
...
qconf -mconf
....
finished_jobs 7000
max_u_jobs 7000
About SGE job epilog
- Configuring the epilog script to transfer files to submit host
In
SGE when a job enters in execution it generates two files:
• stdout.o(JobID)
The output of the job is redirected to this file
• stderr.e(JobID)
The possible errors for the job are redirected to this file
By other hand, on the configuration of the queues and on the "Global Cluster Configuration", there are two
parameters that can be configured:
• Prolog
It can be configured a prolog script to be executed before the job start.
• Epilog
It can be configured an epilog script to be executed when the job finishes.
A typical LCG/gLite installation will not use shared home directories. However, this setup is most unusual at
most
SGE sites as it makes supporting
MPI harder. We will considerer the case where the $HOME directory
is not NFS mounted or shared using some network file sharing techniques.
One major technical problem for is how to handle Job Input/Output and errors. In the NFS based case this is
obviously, job related files are transparently available on the Submission node and worker node via NFS.
However, in non-NFS based cases, any files created by the job on the Worker Node need to be copied back to
the submission node. The most important files we deal with are stdout.o(
JobID) and stderr.e(
JobID), these
files are generated when a job enters in execution on the WN, therefore it is not necessary to prolog script.
This file copying is implemented through a custom job epilog script "epilog.sh". As epilog.sh is executed on
the WN, it is important that this script is copied to all Worker Nodes ant that appropriate access modes are set
(use chmod 755). We copied it in the route:
/usr/local/sge/pro/default/common/epilog.sh
To configure a queue with an epilog script we should:
a) Inspect the queue using the command 'qconf -sql'
b) Edit the queue's settings using the command 'qconf -mq queue_name'
[esfreire@sa3-ce esfreire]$ qconf -mq cesga
qname cesga
....
epilog /usr/local/sge/pro/default/common/epilog.sh
....

A To avoid errors generated in the epilog, it is necessary to configure
StrictHostKeyChecking=no in
each Execution Host in the file /etc/ssh/ssh_config, otherwise epilog fails when requesting the Host
verification.
[root@sa3-wn001 esfreire]# cat /usr/local/sge/pro/default/common/epilog.sh
#!/bin/sh
# Epilog script to transfer stdout and stderr files from the execution host to the submit host.
# This is useful when the home of the users is not shared using NFS.
# by Cesga-Sa3
#########################################################################
# Variables
user=$SGE_O_LOGNAME # Login of the user
machine=$SGE_O_HOST # Machine submit host
stderr=$SGE_STDERR_PATH # File stderr
stdout=$SGE_STDOUT_PATH # File stdout
dest=$SGE_O_WORKDIR # Directory from the which the qsub is sended
dir=`pwd` # Directory on the execution host in the which the job is executed
#########################################################################
# Copy stderr and stdout files back to qmaster
if [ -r $stdout ]; then
scp $stdout $user@$machine:$dest
fi
if [ -r $stderr ]; then
scp $stderr $user@$machine:$dest
fi
# Delete stderr and stdout files if nobody changed their write permissions
if [ -w $stdout ]; then
rm $stdout
fi
if [ -w $stderr ]; then
rm $stderr
fi
FINISTERRAE STRESS TESTS
Overview
Like in SA3 site we made stress tests with new Finisterrae supercomputer located at Cesga to check
SGE behavior in a big cluster:
- SGE 6.1u3 Batch System
- 142 Itanium nodes
Stress Tests
We use basically the same scripts than SA3 version, in this case we check only
SGE batch system and system load, to make this we run scripts monitors in each node and then we submit different jobs arrays from
SGE server.
- Sending from 1000 to 7000 jobs from submitter using qsub arrays.
The load average for sge_qmaster on submitter and sge_execd on all nodes is constant all the time.
Like in SA3 Sge_schedduler memory footprint decreases very fast in a short time, meanwhile Sge_qmaster has a different behavior with some peaks during jobs execution.
- Sending jobs without qsub job arrays.
In this case we send jobs with a for loop, memory consumption is greater than job array test.
CVS: http://jra1mw.cvs.cern.ch:8180/cgi-bin/jra1mw.cgi/org.glite.testsuites.ctb/SGE/
References
Torque/Maui certification Plan and Results: http://master.gridctb.uoa.gr/torque-report/Torque-Maui_testplan.html
Sun Grid Engine Home Page: http://gridengine.sunsource.net/