SGE stress tests of lcg-CE

Overview

  • These tests are based on work made by Nikos Voutsinas, Dimitrios Apostolou and Kostantinos Koukopoulos (see references). We had to change some scripts and commands for being used by SGE system.

  • Only a WN with a free slot was used for this testbed.

Test Scripts

We use these sensors running in our nodes to get information from SGE like memory usage, jobs enqueued, etc. We can download the newer version of these scripts from here: https://twiki.cern.ch/twiki/bin/view/LCG/SGE_Stress#SGE_CVS

  • Monitoring CE monitor_CEsge.sh

#!/bin/bash

VO=${1:-dteam}
USER=${1:-dteam091}      #Change for different qsub users
DATE=`date -Iminutes`
OUTCE="IDs/$DATE"_".CE.out
OUTBDII="IDs/$DATE"_".BDII.out

TIME=`date -Iminutes`
LA=`sed -e 's/^\([^ ]*\).*\/\([^ ]*\).*/\1 \2/' /proc/loadavg`


SGE_QMASTER=`ps auxw|grep sge_qmaster|awk '{print $3" "$4}'|awk 'BEGIN{cpu=0;sz=0}{cpu+=$1;sz+=$2}END{print cpu" "sz}'`
SGE_SCHEDD=`ps auxw|grep sge_schedd|awk '{print $3" "$4}'|awk 'BEGIN{cpu=0;sz=0}{cpu+=$1;sz+=$2}END{print cpu" "sz}'`
BLPARSER=`ps auxw|grep globus-job-manager-script.pl|awk '{print $3" "$4}'|awk 'BEGIN{cpu=0;sz=0}{cpu+=$1;sz+=$2}END{print cpu" "sz}'`


JOBS=`qstat -q $VO|awk 'BEGIN {r=0;o=0}/'"$VO"'/{if ($5 == "r") r++ ;if ($5 ~ /[qweht]/) o++} END {print r" "o}'`
COMPLETED=`qstat -s z -u $USER | awk 'BEGIN { cont=0 }{ cont++; if ( cont > 3 ) print $1}' | wc -l`

[ $? -eq 0 ] || COMPLETED="0"

echo -e "$TIME\t$LA\t$BLPARSER\t$SGE_SCHEDD\t$SGE_QMASTER\t$JOBS\t$COMPLETED" 

  • Monitoring WN monitor_WNsge.sh

#!/bin/bash

SGE_EXECD=`ps --no-headers -C sge_execd -o %cpu,size|awk 'BEGIN{cpu=0;sz=0} {cpu+=\$1;sz+=\$2} END{print cpu" "sz}'`
TIME=`date -Iminutes`
LA=`sed -e 's/^\([^ ]*\).*\/\([^ ]*\).*/\1 \2/' /proc/loadavg`

echo -e "$TIME\t$LA\t$SGE_EXECD"

  • Monitoring BDII on CE monitor_BDIIsge.sh

#!/bin/bash


VO="${1:-dteam}"
CE="${2:-sa3-ce.egee.cesga.es}"


DATE=`date -Iminutes`
OUTCE="IDs/$DATE"_".CE.out
OUTBDII="IDs/$DATE"_".BDII.out


GIP_QUERY="/opt/lcg/libexec/lcg-info-dynamic-sge 2>/dev/null"

BDII_PORT=2170
BDII_BASE="mds-vo-name=cesga-sa3,o=grid"
BDII_QUERY="(&(GlueForeignKey=GlueClusterUniqueID=$CE)(GlueCEName=$VO))"
BDII_ATTRS="GlueCEStateRunningJobs GlueCEStateWaitingJobs GlueCEStateTotalJobs"
BDII_QUERY="ldapsearch -h $CE -p $BDII_PORT -x -b $BDII_BASE $BDII_QUERY $BDII_ATTRS"

PRINTJOBS='/RunningJobs/{r=$2}/TotalJobs/{t=$2}/WaitingJobs/{w=$2}END{print r" "t" "w}'

TIME=`date -Iminutes`
LA=`sed -e 's/^\([^ ]*\).*\/\([^ ]*\).*/\1 \2/' /proc/loadavg`


GIP=`/opt/lcg/libexec/lcg-info-dynamic-sge 2>/dev/null|sed -ne "/$VO/,/^$/p"| awk "$PRINTJOBS"`


LDAP=`$BDII_QUERY |awk "$PRINTJOBS"`
[ $? -eq 0 ] || LDAP="0 0 0"

echo -e "$TIME\t$LA\t$GIP\t$LDAP"


Gnuplot Scripts

  • wms-lrms.sh
# wms-lrms.sh outwms outCE_wms graph.png
WMS_INPUT=$1
LRMS_INPUT=$2
OUTPUT=$3
TICS=$4

( 
cat<<EOF
set grid
set key left nobox
set xtics border nomirror norotate $TICS
set ytics border nomirror norotate
set y2tics border nomirror norotate
set style data lines

set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S+0200"
set format x "%H:%M"
set xlabel "Time"

set ylabel "Jobs"

set title "WMS vs LRMS status report"

set terminal png size 600,300
EOF

[ -n "$OUTPUT" ] && echo "set output '$OUTPUT'"
echo "plot '$WMS_INPUT' using 1:9 title 'WMS Ready' with lines lw 3, \
           '$WMS_INPUT' using 1:3 title 'WMS Done' with lines lw 3, \
           '$WMS_INPUT' using 1:7 title 'WMS Running' with lines lw 3, \
           '$WMS_INPUT' using 1:5 title 'WMS Scheduled' with lines lw 3, \
           '$LRMS_INPUT' using 1:12 title 'LRMS Done' with lines lw 3"

)|gnuplot

  • gip-bdii.sh
# gip-bdii.sh outBDII graph.png
INPUT=$1
OUTPUT=$2
TICS=$3
( 
cat<<EOF
set grid
set key left nobox
set xtics border nomirror norotate $TICS
set ytics border nomirror norotate
set style data lines

set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S+0200"
set format x "%H:%M"
set xlabel "Time"

set ylabel "Jobs"

set title "LRMS Queue Length"

set terminal png size 500,300
EOF

echo "set output '$OUTPUT'"
echo "plot '$INPUT' using 1:6 title 'Waiting (GIP)' with lines lw 3, \
     '$INPUT' using 1:9 title 'Waiting (BDII)' with lines lw 3"
)|gnuplot

  • loads.sh
# loads.sh outCE outWN outCE graph.png
 

INPUT1=$1
INPUT2=$2
INPUT3=$3
OUTPUT=$4
TICS=$5
( 
cat<<EOF
set grid
set key left nobox
set xtics border nomirror norotate $TICS
set ytics border nomirror norotate 
set style data lines

set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S+0200"
set format x "%H:%M"
set xlabel "Time"

set ylabel "1 min load average"

set title "Average Loads"

set terminal png size 500,300
EOF

echo "set output '$OUTPUT'"
echo "plot '$INPUT1' using 1:2 title 'lcgCE' with lines lw 3, \
     '$INPUT2' using 1:2 title 'WN_SGE' with lines lw 3, \
     '$INPUT3' using 1:8 title 'SGE_server' with lines lw 3"
)|gnuplot

  • sgeqmaster-mem.sh
# sgeqmaster-mem.sh outCE graph.png
INPUT=$1
OUTPUT=$2
TICS=$3
( 
cat<<EOF
set mouse
set grid
set key left nobox
set xtics border nomirror norotate $TICS
set ytics border nomirror norotate
set y2tics border nomirror norotate
set style data lines

set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S+0200"
set format x "%H:%M"
set xlabel "Time"

set ylabel "Size (x1000k)"
set y2label "Queued Jobs"

set title "SGE Qmaster Process Size"

set terminal png size 600,300
EOF


echo "set output '$OUTPUT'"
echo "plot '$INPUT' using 1:11 axes x1y2 title 'Jobs' with lines lw 3, \
           '$INPUT' using 1:9 title 'SGE Qmaster Footprint' with lines lw 3"
            
)|gnuplot

  • sgeschedd-mem.sh
# sgeschedd-mem.sh outCE graph.png
INPUT=$1
OUTPUT=$2
TICS=$3
( 
cat<<EOF
set mouse
set grid
set key left nobox
set xtics border nomirror norotate $TICS
set ytics border nomirror norotate
set y2tics border nomirror norotate
set style data lines

set xdata time
set timefmt "%Y-%m-%dT%H:%M:%S+0200"
set format x "%H:%M"
set xlabel "Time"

set ylabel "Size (x1000k)"
set y2label "Queued Jobs"

set title "SGE Schedd Process Size"

set terminal png size 600,300
EOF


echo "set output '$OUTPUT'"
echo "plot '$INPUT' using 1:11 axes x1y2 title 'Jobs' with lines lw 3, \
           '$INPUT' using 1:7 title 'SGE Schedd Footprint' with lines lw 3"
            
)|gnuplot

Job Submitting

  • Script job_sub.sh. This script sends a NUM jobs using a wms creating IDs directory to storage logs.

#!/bin/bash

DATE=`date -Iminutes`

SITE=${1:-cern}
NUM=${2:-20}
JOB=${3:-sa3-200.jdl}
WMS=$4

IDS="IDs/$DATE"_"$SITE"
OUT="IDs/$DATE"_"$SITE"_"$NUM".out
STATUS="IDs/$DATE"_"$SITE"_"$NUM".status

#mkdir IDs

if [ $SITE == "uoa" ];then
        WMS="https://ctb06.gridctb.uoa.gr:7443/glite_wms_wmproxy_server"
elif [ $SITE == "cern" ];then
        #WMS="https://lxb2032.cern.ch:7443/glite_wms_wmproxy_server"
        WMS="https://wms01.egee.cesga.es:7443/glite_wms_wmproxy_server"
elif [ $SITE == "cy" ];then
        WMS="https://wmslb201.grid.ucy.ac.cy:7443/glite_wms_wmproxy_server"
fi

DELAY=1

echo "$DATE"> $OUT
echo "$DATE"> $STATUS
time for i in `seq 1 $NUM`; do 
        glite-wms-job-submit -a -o $IDS -c /opt/cesga/lcg-sa3/glite/etc/dteam/glite_wms.conf -e $WMS $JOB
        sleep $DELAY
done>>$OUT 2>&1 &

while true
do
        SEC=`date +%S`
        if [ $SEC -ge "59" -o $SEC -le "1" ];then
                JOBS="`glite-wms-job-status -c /opt/cesga/lcg-sa3/glite/etc/dteam/glite_wms.conf --noint --logfile /dev/null -i $IDS |awk '
                BEGIN {a["Done"]     =a["Ready"]    =a["Running"]=a["Scheduled"]=0;
                       a["Undefined"]=a["Submitted"]=a["Waiting"]=a["Cleared"]  =0;
                       a["Aborted"]  =a["Cancelled"]=a["Purged"] =a["Unknown"]  =0;}
                /Current Status:[ ]*([^ ]*)/ {a[$3]+=1;total+=1}
                END {if (total==0) exit 1;
                     for (i in a){ 
                          printf("%s=%s\t",i,a[i]);}
                     printf("\n");}'`"
                if [ $? -eq 0 ];then
                        CURRENT_DATE=`date -Iminutes`
                        echo -e "$CURRENT_DATE\t$JOBS" >> $STATUS
                        WAIT_JOBS=`echo "$JOBS" |egrep -qi "(SUBMITTED=[^0]|WAITING=[^0]|READY=[^0]|SCHEDULED=[^0]|RUNNING=[^0])";echo $?`
                        if [ "$WAIT_JOBS" -ne 0 ];then
                                break;
                        fi
OUT="IDs/$DATE"_"$SITE"_"$NUM".out
STATUS="IDs/$DATE"_"$SITE"_"$NUM".status

if [ $SITE == "uoa" ];then
        WMS="https://ctb06.gridctb.uoa.gr:7443/glite_wms_wmproxy_server"
elif [ $SITE == "cern" ];then
        #WMS="https://lxb2032.cern.ch:7443/glite_wms_wmproxy_server"
        WMS="https://wms01.egee.cesga.es:7443/glite_wms_wmproxy_server"
elif [ $SITE == "cy" ];then
        WMS="https://wmslb201.grid.ucy.ac.cy:7443/glite_wms_wmproxy_server"
fi

DELAY=1

echo "$DATE"> $OUT
echo "$DATE"> $STATUS
time for i in `seq 1 $NUM`; do 
        glite-wms-job-submit -a -o $IDS -c /opt/cesga/lcg-sa3/glite/etc/dteam/glite_wms.conf -e $WMS $JOB
        sleep $DELAY
done>>$OUT 2>&1 &

while true
do
        SEC=`date +%S`
        if [ $SEC -ge "59" -o $SEC -le "1" ];then
                JOBS="`glite-wms-job-status -c /opt/cesga/lcg-sa3/glite/etc/dteam/glite_wms.conf --noint --logfile /dev/null -i $IDS |awk '
                BEGIN {a["Done"]     =a["Ready"]    =a["Running"]=a["Scheduled"]=0;
                       a["Undefined"]=a["Submitted"]=a["Waiting"]=a["Cleared"]  =0;
                       a["Aborted"]  =a["Cancelled"]=a["Purged"] =a["Unknown"]  =0;}
                /Current Status:[ ]*([^ ]*)/ {a[$3]+=1;total+=1}
                END {if (total==0) exit 1;
                     for (i in a){ 
                          printf("%s=%s\t",i,a[i]);}
                     printf("\n");}'`"
                if [ $? -eq 0 ];then
                        CURRENT_DATE=`date -Iminutes`
                        echo -e "$CURRENT_DATE\t$JOBS" >> $STATUS
                        WAIT_JOBS=`echo "$JOBS" |egrep -qi "(SUBMITTED=[^0]|WAITING=[^0]|READY=[^0]|SCHEDULED=[^0]|RUNNING=[^0])";echo $?`
                        if [ "$WAIT_JOBS" -ne 0 ];then
                                break;
                        fi

                        sleep 10
                fi
        fi
        sleep 2
done

  • Script qsub_test.sh. This script sends $1 jobs using sge qsub by user.
#!/bin/sh

# --> All jobs run in high priority to execute before others, since they are
# simple but important. High priority is represented by high value!

USER="esfreire"
GROUP="cesga"
QUEUE="cesga"
HOWMANY="$1"
TOTAL=`echo $HOWMANY - 1 | bc` # Because if we execute the script with 200 the qsub only execute 199
JOB='cd $HOME;date;hostname;whoami;pwd';

if [ -z $HOWMANY ];then
        exit 2
fi

trap "exit 2" 1 2 3 15
# Create a unique temporary directory, else exit with error
TMPDIR=`mktemp -d $HOME/qsub_test-XXXXXXXX` || {
     echo Could not create temporary directory in /home/esfreire/ ; exit 1; }
# Clean up the TMPDIR always before leaving the script
trap "rm -rf $TMPDIR" 0



JOBIDS="$TMPDIR/jobids"
rm -f $JOBIDS

echo "START TIME: `date -Iminutes`"

# Then check for succesful execution of HOWMANY jobs, meaning all of them
echo Submitting $HOWMANY simple jobs...
for JOBNO in `seq 1 $HOWMANY`
do
        JOBDIR="$TMPDIR/$JOBNO"
        mkdir $JOBDIR
        cd $JOBDIR 
        echo $JOB | qsub -p 1024 >> $JOBIDS
done

# Wait for all jobs to complete, or be removed somehow from queue. 
# WARNING: this could wait for too much so this script should be supervised,
# either by the user or by another script
echo Waiting all jobs to end...
COMPLETED=`qstat -s z -u $USER | awk 'BEGIN { cont=0 }{ cont++; if ( cont > 3 ) print $1}' | wc -l`
while [ $COMPLETED != $TOTAL ]
do
        sleep 30
        COMPLETED=`qstat -s z -u $USER | awk 'BEGIN { cont=0 }{ cont++; if ( cont > 3 ) print $1}' | wc -l`
        echo "$COMPLETED jobs completed..."
done
echo "DONE!"
# Now that we know the jobs are done, check stdout and stderr of all jobs
echo Checking successful execution...
echo A simple dot means job success, an exclamation mark is failure!
FAIL=no
for JOBNO in `seq 1 $HOWMANY`
do
        JOBOUT="$TMPDIR/$JOBNO/STDIN.o*"
        JOBERR="$TMPDIR/$JOBNO/STDIN.e*"
        if [ `cat $JOBERR 2>/dev/null | wc -l` != 0 -o `cat $JOBOUT 2>/dev/null | wc -l` == 0 ]
        then
                echo -n !
                FAIL=yes
        else
                echo -n .
        fi
done
echo

echo "END TIME: `date -Iminutes`"

[ $FAIL == yes ] && exit 1

  • Script qsub1.sh. This is a stress script submitting a large number of jobs and after delete them immediately.

#!/bin/bash

NUM_OF_ITERATIONS=${1:-2};
NUM_OF_JOBS=${2:-10};
SLEEP_TIME=${3:-604800};
USER="cesga025";
QUEUE="cesga"
FILE_OUT=`mktemp $HOME/qsub_stress_out_XXXXXXXX`
FILE_ERR=`mktemp $HOME/qsub_stress_err_XXXXXXXX`


exec 1<>$FILE_OUT
exec 2<>$FILE_ERR

submit_job () {
   BS_JOB_ID="";
   #BS_JOB_ID=`su - $USER -c "echo sleep $SLEEP_TIME | qsub -q $QUEUE -o $TMPDIR/tmpout -e $TMPDIR/tmperr"`
   BS_JOB_ID=`echo sleep $SLEEP_TIME | qsub -q $QUEUE -o $HOME/stdout -e $HOME/stderr | cut -d" " -f3`
}

jobs_iteration () {
        for JOB_NUM in `seq 1 $NUM_OF_JOBS`;do
                echo -e "0\t$INTER_NUM\t$JOB_NUM"
                submit_job && JOBS_ARRAY[JOB_NUM]=$BS_JOB_ID;
        done

        for JOB_NUM in `seq 1 $NUM_OF_JOBS`;do
                echo -e "1\t$INTER_NUM\t$JOB_NUM\t${JOBS_ARRAY[JOB_NUM]}"
                qdel ${JOBS_ARRAY[JOB_NUM]}
        done
}

for ITER_NUM in `seq 1 $NUM_OF_ITERATIONS`;do
        jobs_iteration &
        # this wait a hour and it returns to send the jobs + 1000
        sleep 3600
        NUM_OF_JOBS=`echo $NUM_OF_JOBS + 1000 | bc`
done

wait

mv $FILE_OUT qsub_stress_`date -Iminutes`_"$NUM_OF_ITERATIONS"_"$NUM_OF_JOBS".out
mv $FILE_ERR qsub_stress_`date -Iminutes`_"$NUM_OF_ITERATIONS"_"$NUM_OF_JOBS".err

exit 0;

Performed Tests

BDII

  • Check Batch System information published through BDII

>ldapsearch -h sa3-ce.egee.cesga.es -p 2170 -x -b mds-vo-name=cesga-sa3,o=grid "(&(GlueForeignKey=GlueClusterUniqueID=sa3-ce.egee.cesga.es)""(GlueCEName=dteam))" GlueCEInfoHostName GlueCEInfoLRMSType GlueCEInfoLRMSVersion GlueCEInfoJobManager

  • Results

#
# filter: (&(GlueForeignKey=GlueClusterUniqueID=sa3-ce.egee.cesga.es)(GlueCEName=dteam))
# requesting: GlueCEInfoHostName GlueCEInfoLRMSType GlueCEInfoLRMSVersion GlueCEInfoJobManager 
#

# sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam, CESGA-SA3, grid
dn: GlueCEUniqueID=sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam,mds-vo-na
 me=CESGA-SA3,o=grid
GlueCEInfoHostName: sa3-ce.egee.cesga.es
GlueCEInfoLRMSType: sge
GlueCEInfoLRMSVersion: 6.0u7
GlueCEInfoJobManager: lcgsge

# search result
search: 2
result: 0 Success

# numResponses: 2
# numEntries: 1

QUEUES

  • Check Batch System configuration
>qconf -sq '*'

  • Results
qname                 biomed
hostlist              sa3-ce.egee.cesga.es sa3-wn001.egee.cesga.es
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              19
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 1
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            biomed
xuser_lists           NONE
subordinate_list      NONE

...

  • Nodes list:
>qhost

  • Results:
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
sa3-ce                  -               -     -       -       -       -       -
sa3-wn001               lx26-x86        1  0.00  248.2M   19.1M  509.8M    3.9M

>ls /usr/local/sge/V60u7_1/default/spool/qmaster/exec_hosts/
global  sa3-ce.egee.cesga.es  sa3-wn001.egee.cesga.es  template

  • View queues configuration
>qconf -sconf

  • Results:
global:
execd_spool_dir              /opt/cesga/sge60/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             unix_behavior
login_shells                 sh,ksh,csh,tcsh,bash
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
....

  • Configuring WNs

> qconf -se sa3-wn001.egee.cesga.es

  • Results:
hostname              sa3-wn001.egee.cesga.es
load_scaling          NONE
complex_values        NONE
load_values           arch=lx26-x86,num_proc=1,mem_total=248.183594M, \
                      swap_total=509.835938M,virtual_total=758.019531M, \
                      load_avg=0.000000,load_short=0.000000, \
                      load_medium=0.000000,load_long=0.000000, \
                      mem_free=229.195312M,swap_free=505.984375M, \
                      virtual_free=735.179688M,mem_used=18.988281M, \
                      swap_used=3.851562M,virtual_used=22.839844M, \
                      cpu=0.000000,np_load_avg=0.000000, \
                      np_load_short=0.000000,np_load_medium=0.000000, \
                      np_load_long=0.000000
processors            1
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE

Network WN:

NETWORK

  • Check network ports

>netstat -apt

  • Results on WN:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
tcp        0      0 *:sunrpc                    *:*                         LISTEN      687/portmap         
tcp        0      0 *:32786                     *:*                         LISTEN      706/rpc.statd       
tcp        0      0 *:8659                      *:*                         LISTEN      5175/gmond          
tcp        0      0 xen.egee.cesga.es:32788     *:*                         LISTEN      876/xinetd          
tcp        0      0 *:ssh                       *:*                         LISTEN      862/sshd            
tcp        0      0 xen.egee.cesga.es:smtp      *:*                         LISTEN      918/sendmail: accep 
tcp        0      0 *:sge_execd                 *:*                         LISTEN      854/sge_execd       
tcp        0    144 sa3-wn001.egee.cesga.es:ssh paco.murcia.cesga.es:37513  ESTABLISHED 9006/0              
tcp        0      0 sa3-wn001.egee.cesga.:35930 sa3-ce.egee.ces:sge_qmaster ESTABLISHED 854/sge_execd   

  • Results on CE:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name   
tcp        0      0 *:globus-gatekeeper         *:*                         LISTEN      7963/edg-gatekeeper 
tcp        0      0 *:9002                      *:*                         LISTEN      9861/edg-wl-logd    
tcp        0      0 *:mysql                     *:*                         LISTEN      3917/mysqld         
tcp        0      0 *:8659                      *:*                         LISTEN      4182/gmond          
tcp        0      0 sa3-ce.egee.cesga.es:2135   *:*                         LISTEN      6325/slapd          
tcp        0      0 *:sge_qmaster               *:*                         LISTEN      10030/sge_qmaster   
tcp        0      0 localhost.localdomain:smtp  *:*                         LISTEN      792/sendmail: accep 
tcp        0      0 *:2170                      *:*                         LISTEN      28795/bdii-fwd [acc 
tcp        0      0 localhost.localdomain:2171  *:*                         LISTEN      10154/slapd         
tcp        0      0 *:2811                      *:*                         LISTEN      9542/ftpd: acceptin 
tcp        0      0 localhost.localdomain:2172  *:*                         LISTEN      10687/slapd         
tcp        0      0 localhost.localdomain:2173  *:*                         LISTEN      9666/slapd          
tcp        0      0 sa3-ce.egee.ces:sge_qmaster sa3-wn001.egee.cesga.:35930 ESTABLISHED 10030/sge_qmaster   
tcp        0      0 sa3-ce.egee.cesga.es:58991  sa3-mon.egee.cesga.es:12409 ESTABLISHED 28644/edg-fmon-agen 
tcp        0      0 sa3-ce.egee.cesga.es:2135   sa3-ce.egee.cesga.es:40588  TIME_WAIT   -                   
tcp        0      0 sa3-ce.egee.cesga.es:56406  sa3-ce.egee.ces:sge_qmaster TIME_WAIT   -                   
tcp        0      0 sa3-ce.egee.cesga.es:56405  sa3-ce.egee.ces:sge_qmaster TIME_WAIT   -                   
tcp        0      0 sa3-ce.egee.cesga.es:8659   www2.egee.cesga.es:47588    TIME_WAIT   -                   
tcp        0      0 sa3-ce.egee.ces:sge_qmaster sa3-ce.egee.cesga.es:51785  ESTABLISHED 10030/sge_qmaster   
tcp        0      0 localhost.localdomain:2171  localhost.localdomain:60003 TIME_WAIT   -                   
tcp        0      0 sa3-ce.egee.cesga.es:8659   www2.egee.cesga.es:47596    TIME_WAIT   -                   
tcp        0      0 localhost.localdomain:2172  localhost.localdomain:51367 TIME_WAIT   -                   
tcp        0      0 localhost.localdomain:60003 localhost.localdomain:2171  TIME_WAIT   -                   
tcp        0      0 sa3-ce.egee.cesga.es:51785  sa3-ce.egee.ces:sge_qmaster ESTABLISHED 10063/sge_schedd    
tcp        0      0 sa3-ce.egee.cesga.es:8659   www2.egee.cesga.es:47603    TIME_WAIT   -                   
tcp        0      0 localhost.localdomain:51366 localhost.localdomain:2172  TIME_WAIT   -                   
tcp        0      0 sa3-ce.egee.cesga.es:8659   www2.egee.cesga.es:47572    TIME_WAIT   -                   
tcp        0      0 sa3-ce.egee.cesga.es:45480  sa3-mon.egee.cesga.es:2135  TIME_WAIT   -                   
tcp        0      0 sa3-ce.egee.cesga.es:8659   www2.egee.cesga.es:47580    TIME_WAIT   -                   
tcp        0      0 *:ssh                       *:*                         LISTEN      29087/sshd          
tcp        0   2304 sa3-ce.egee.cesga.es:ssh    paco.murcia.cesga.es:57729  ESTABLISHED 6591/0              

  • View sge qmaster logs.

>cat /usr/local/sge/V60u7_1/default/spool/qmaster/messages

  • Results:
03/29/2007 18:05:33|qmaster|sa3-ce|I|read job database with 0 entries in 0 seconds
03/29/2007 18:05:33|qmaster|sa3-ce|I|qmaster hard descriptor limit is set to 1024
03/29/2007 18:05:33|qmaster|sa3-ce|I|qmaster soft descriptor limit is set to 1024
03/29/2007 18:05:33|qmaster|sa3-ce|I|qmaster will use max. 1004 file descriptors for communication
03/29/2007 18:05:33|qmaster|sa3-ce|I|qmaster will accept max. 99 dynamic event clients
03/29/2007 18:05:33|qmaster|sa3-ce|I|starting up 6.0u7
...

JOBS

>vi sa3.jdl

JobType         = "Normal";
ShallowRetryCount = 0;
RetryCount      = 0;
Executable      = "sa3.sh";
Arguments       = "180 1";
StdOutput       = "sa3.out";
StdError        = "sa3.err";
InputSandbox    = {"sa3.sh"};
OutputSandbox   = {"sa3.out","sa3.err"};
#Rank           = other.GlueCEStateFreeCPUs;
Requirements    = other.GlueCEUniqueID == "sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam"

>vi sa3.sh

#!/bin/bash

SLEEP_TIME=${1:-0};
DO_MD5=${2:-0};
FILE_MD5=${3:-/dev/urandom};

NUM_OF_BG_JOBS=2

date
hostname
whoami
pwd


if [ "$DO_MD5" != "0" -a -r $FILE_MD5 ];then
        for i in `seq 1 $NUM_OF_BG_JOBS`;do
                md5sum $FILE_MD5 &
                job_id=$!
                jobs_array[i]=$job_id
        done
fi

if [ "$SLEEP_TIME" != "0" ];then
        sleep $SLEEP_TIME
fi

if [ "$DO_MD5" != "0" -a -f $FILE_MD5 ];then
        for i in seq `1 $NUM_OF_BG_JOBS`;do
                kill ${jobs_array[i]}
        done
fi
exit 0;

  • Submit 10 job:

>for i in `seq 1 10`; do edg-job-submit --vo dteam -o test.jid sa3.jdl;done

[root@sa3-ce libexec]# qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    631 0.00000 STDIN      dteam001     qw    04/30/2007 18:22:29                                    1        
    632 0.00000 STDIN      dteam001     qw    04/30/2007 18:22:33                                    1        

  • Scheduler Info Provider: On site BDII:
# sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam, CESGA-SA3, grid 
dn: GlueCEUniqueID=sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam,mds-vo-na 
 me=CESGA-SA3,o=grid 
objectClass: GlueCETop 
objectClass: GlueCE 
objectClass: GlueSchemaVersion 
objectClass: GlueCEAccessControlBase 
objectClass: GlueCEInfo 
objectClass: GlueCEPolicy 
objectClass: GlueCEState 
objectClass: GlueInformationService 
objectClass: GlueKey 
GlueCEHostingCluster: sa3-ce.egee.cesga.es 
GlueCEName: dteam 
GlueCEUniqueID: sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam 
GlueCEInfoGatekeeperPort: 2119 
GlueCEInfoHostName: sa3-ce.egee.cesga.es 
GlueCEInfoLRMSType: sge 
GlueCEInfoLRMSVersion: 6.0u7 
GlueCEInfoTotalCPUs: 2 
GlueCEInfoJobManager: lcgsge 
GlueCEInfoContactString: sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam 
GlueCEInfoApplicationDir: /opt/exp_soft 
GlueCEInfoDataDir: unset 
GlueCEInfoDefaultSE: sa3-se.egee.cesga.es 
GlueCEStateEstimatedResponseTime: 2159906 
GlueCEStateFreeCPUs: 0 
GlueCEStateRunningJobs: 1 
GlueCEStateStatus: Production 
GlueCEStateTotalJobs: 8 
GlueCEStateWaitingJobs: 7 
GlueCEStateWorstResponseTime: 4319813 
GlueCEStateFreeJobSlots: 0 
GlueCEPolicyMaxCPUTime: 4320 
GlueCEPolicyMaxRunningJobs: 2 
GlueCEPolicyMaxTotalJobs: 0 
GlueCEPolicyMaxWallClockTime: 9000 
GlueCEPolicyPriority: 1 
GlueCEPolicyAssignedJobSlots: 0 
GlueCEAccessControlBaseRule: VO:dteam 
GlueForeignKey: GlueClusterUniqueID=sa3-ce.egee.cesga.es 
GlueInformationServiceURL: ldap://sa3-ce.egee.cesga.es:2135/mds-vo-name=local, 
 o=grid 
GlueSchemaVersionMajor: 1 
GlueSchemaVersionMinor: 2 

  • Scheduler Info Provider with heavy LRMS load:
# sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam, CESGA-SA3, grid 
dn: GlueCEUniqueID=sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam,mds-vo-na 
 me=CESGA-SA3,o=grid 
objectClass: GlueCETop 
objectClass: GlueCE 
objectClass: GlueSchemaVersion 
objectClass: GlueCEAccessControlBase 
objectClass: GlueCEInfo 
objectClass: GlueCEPolicy 
objectClass: GlueCEState 
objectClass: GlueInformationService 
objectClass: GlueKey 
GlueCEHostingCluster: sa3-ce.egee.cesga.es 
GlueCEName: dteam 
GlueCEUniqueID: sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam 
GlueCEInfoGatekeeperPort: 2119 
GlueCEInfoHostName: sa3-ce.egee.cesga.es 
GlueCEInfoLRMSType: sge 
GlueCEInfoLRMSVersion: 6.0u7 
GlueCEInfoTotalCPUs: 2 
GlueCEInfoJobManager: lcgsge 
GlueCEInfoContactString: sa3-ce.egee.cesga.es:2119/jobmanager-lcgsge-dteam 
GlueCEInfoApplicationDir: /opt/exp_soft 
GlueCEInfoDataDir: unset 
GlueCEInfoDefaultSE: sa3-se.egee.cesga.es 
GlueCEStateEstimatedResponseTime: 12689927 
GlueCEStateFreeCPUs: 0 
GlueCEStateRunningJobs: 1 
GlueCEStateStatus: Production 
GlueCEStateTotalJobs: 47 
GlueCEStateWaitingJobs: 46 
GlueCEStateWorstResponseTime: 25379855 
GlueCEStateFreeJobSlots: 0 
GlueCEPolicyMaxCPUTime: 4320 
GlueCEPolicyMaxRunningJobs: 2 
GlueCEPolicyMaxTotalJobs: 0 
GlueCEPolicyMaxWallClockTime: 9000 
GlueCEPolicyPriority: 1 
GlueCEPolicyAssignedJobSlots: 0 
GlueCEAccessControlBaseRule: VO:dteam 
GlueForeignKey: GlueClusterUniqueID=sa3-ce.egee.cesga.es 
GlueInformationServiceURL: ldap://sa3-ce.egee.cesga.es:2135/mds-vo-name=local, 
 o=grid 
GlueSchemaVersionMajor: 1 
GlueSchemaVersionMinor: 2 

ACCOUNTING

>qacct -b 0705011000 -o dteam001 
OWNER        WALLCLOCK         UTIME         STIME           CPU             MEMORY                 IO                IOW 
========================================================================================================================= 
dteam001           881            27            24           738              2.527              0.000              0.000 

  • Number of jobs runned by VO:
[esfreire@sa3-ce esfreire]$ qacct -j -g dteam | grep jobnumber | wc -l
  4488

BATCH SYSTEM

Description: About 4 simple jobs with a sleep of 180s are submitted directly to the LRMS. While the jobs are running, using the command-line interface of SGE, in the Worker Node the daemon of SGE is switched to down state. Afterwards they are switched to online state again. Moreover the resilience of the LRMS is checked by shutting down sge_qmaste and sge_scheduler and starting it again after a while.

Comments: If we stopped the qmaster and the scheduler in the CE etc/init.d/sgemaster stop, we don't see the jobs running but when we restart it , we see the jobs running again, and them finished without problems. However if we stopped el SGE_execd in the WN, when we restart it the job is killed.

In SGE for changing the state the WN like offline, a form to do it would be configuring num_proc like conplex in WN and configuring num_proc=0 when we do not want the jobs to enter this node, but we have not tested it in the current testbed but we are using it in other CESGA's clusters.


Results

Jobs Status

  • Sending 200 jobs from WMS and a RB.
We can observe that WMS job submission is faster than RB and WMS reports jobs done a little slower (only a few minutes) than LRMS.

wms01.png RB.png

  • Sending 200 jobs using qsub.
In LRMS queue length we can see that sge GIP response is fast, a minute delay in the worst case respect bdii. The WN sge and sge_server average load is very low.

bdiiqsub200.png loadqsub200.png

Sge_schedduler memory footprint decreases very fast in a short time, meanwhile Sge_qmaster has a different behavior with some peaks during jobs execution.

scheddmemqsub200.png qmastermemqsub200.png

  • Sending 1000 jobs using qsub.
GIP response remains constant like the previous case.WN sge is more loaded than CE (CE load average is the same that with 200 jobs)

bdiiqsub1000.png loadqsub1000.png

With 1000 jobs Sge_schedduler memory was free in 2 hours but Sge_qmaster remains constant.

scheddmemqsub1000.png qmastermemqsub1000.png

Stress Tests

  • Stressing the LRMS memory management.
Submitting 1000,2000...7000 jobs and deleting them with different time intervals. With a large number of jobs Sge_schedduler experiments some used memory peaks but it decreases about a hour, in 12 hours the used memory is constant. After several hours we submit 7000 jobs and we got another memory peak with a fast reduction memory usage (about 2 hours).

scheddmemqsubstress.png

Sge_Qmaster uses more memory than the schedduler, maximum usage was a peak of 500M. The memory usage was constant after the peaks with slow reductions.

qmastermemqsubstress.png

  • Behavior of LRMS memory management after stress tests during a large time period.
After 24h with a normal load the memory usage returns to initial state. scheddmemlong.png

qmastermemlong.png


Notes

About SGE queue

Tip, idea To do the tests, we had to change the default configuration of SGE. Specifically we changed the scheduler (qconf -msconf) and global (qconf -mconf) configuration so SGE accepts 7000 jobs and qstat -s z also displays 7000 jobs.

  • For successive tests, it's the same operation, i.e. to send 2000, 4000.. jobs

qconf -msconf

...
maxujobs                          7000
max_functional_jobs_to_schedule   7000
...

qconf -mconf

....
finished_jobs                7000
max_u_jobs                   7000

About SGE job epilog

  • Configuring the epilog script to transfer files to submit host

In SGE when a job enters in execution it generates two files:

     &#8226;   stdout.o(JobID)

The output of the job is redirected to this file

     &#8226;   stderr.e(JobID)

The possible errors for the job are redirected to this file

By other hand, on the configuration of the queues and on the "Global Cluster Configuration", there are two parameters that can be configured:


     &#8226;   Prolog

It can be configured a prolog script to be executed before the job start.

     &#8226;   Epilog

It can be configured an epilog script to be executed when the job finishes.

A typical LCG/gLite installation will not use shared home directories. However, this setup is most unusual at most SGE sites as it makes supporting MPI harder. We will considerer the case where the $HOME directory is not NFS mounted or shared using some network file sharing techniques. One major technical problem for is how to handle Job Input/Output and errors. In the NFS based case this is obviously, job related files are transparently available on the Submission node and worker node via NFS. However, in non-NFS based cases, any files created by the job on the Worker Node need to be copied back to the submission node. The most important files we deal with are stdout.o(JobID) and stderr.e(JobID), these files are generated when a job enters in execution on the WN, therefore it is not necessary to prolog script. This file copying is implemented through a custom job epilog script "epilog.sh". As epilog.sh is executed on the WN, it is important that this script is copied to all Worker Nodes ant that appropriate access modes are set (use chmod 755). We copied it in the route: /usr/local/sge/pro/default/common/epilog.sh

To configure a queue with an epilog script we should:


         a) Inspect the queue using the command 'qconf -sql'
         b) Edit the queue's settings using the command 'qconf -mq queue_name'

[esfreire@sa3-ce esfreire]$ qconf -mq cesga

qname                 cesga
....
epilog                /usr/local/sge/pro/default/common/epilog.sh
....

Tip, idea A To avoid errors generated in the epilog, it is necessary to configure StrictHostKeyChecking=no in each Execution Host in the file /etc/ssh/ssh_config, otherwise epilog fails when requesting the Host verification.

[root@sa3-wn001 esfreire]# cat /usr/local/sge/pro/default/common/epilog.sh

#!/bin/sh

# Epilog script to transfer stdout and stderr files from the execution host to the submit host. 
# This is useful when the home of the users is not shared using NFS.
# by Cesga-Sa3

#########################################################################
# Variables
user=$SGE_O_LOGNAME # Login of the user
machine=$SGE_O_HOST # Machine submit host
stderr=$SGE_STDERR_PATH # File stderr
stdout=$SGE_STDOUT_PATH # File stdout
dest=$SGE_O_WORKDIR # Directory from the which the qsub is sended
dir=`pwd` # Directory on the execution host in the which the job is executed
#########################################################################

# Copy stderr and stdout files back to qmaster
if [ -r $stdout ]; then
   scp $stdout $user@$machine:$dest
fi
if [ -r $stderr ]; then
   scp $stderr $user@$machine:$dest
fi


# Delete stderr and stdout files if nobody changed their write permissions
if [ -w $stdout ]; then
   rm $stdout
fi
if [ -w $stderr ]; then
   rm $stderr
fi


FINISTERRAE STRESS TESTS

Overview

Like in SA3 site we made stress tests with new Finisterrae supercomputer located at Cesga to check SGE behavior in a big cluster:

  • SGE 6.1u3 Batch System
  • 142 Itanium nodes
    • 16 CPUs
    • 128Gb Mem

Stress Tests

We use basically the same scripts than SA3 version, in this case we check only SGE batch system and system load, to make this we run scripts monitors in each node and then we submit different jobs arrays from SGE server.

  • Sending from 1000 to 7000 jobs from submitter using qsub arrays.

load_qmaster.png graphcn009_outWN_2008-04-03.png

The load average for sge_qmaster on submitter and sge_execd on all nodes is constant all the time.

Like in SA3 Sge_schedduler memory footprint decreases very fast in a short time, meanwhile Sge_qmaster has a different behavior with some peaks during jobs execution.

sgeqmaster-mem.png sgeschedd-mem.png

  • Sending jobs without qsub job arrays.

In this case we send jobs with a for loop, memory consumption is greater than job array test.

sgeqmaster-mem_for.png sgeschedd-mem_for.png


SGE CVS

CVS: http://jra1mw.cvs.cern.ch:8180/cgi-bin/jra1mw.cgi/org.glite.testsuites.ctb/SGE/

References

Torque/Maui certification Plan and Results: http://master.gridctb.uoa.gr/torque-report/Torque-Maui_testplan.html

Sun Grid Engine Home Page: http://gridengine.sunsource.net/

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2008-04-09 - AlvaroSimon
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback