gLite CE Lsf Testing

Test Environment

  • Site used: SA3-TB-CNAF
  • Date: July 2007
  • Hosts:
    • wmstest-ce06.cr.cnaf.infn.it CE hostname
    • wmstest-ce02.cr.cnaf.infn.it LSF master node (and also WN)
    • wmstest-ce03.cr.cnaf.infn.it WN
    • wmstest-ce04.cr.cnaf.infn.it WN
    • wmstest-ce05.cr.cnaf.infn.it BDII
  • OS: Scientific Linux CERN Release 3.0.8 (SL)
  • LRMS: Platform LSF 6.1, Dec 15 2004
  • Repository: http://lxb2042.cern.ch/gLite/APT/R3.0-cert rhel30 externals Release3.0 updates updates.certified internal

Configuration

No particular configuration is needed for the batch system. In the one used for the test we adopt a standard configuration:

  • we use a separate lsf-master and then we export via nfs the installed directory on the CE and on the WNs

Check LSF LIM configuration

  • [lsfadmin@wmstest-ce02 lsfadmin]# lsadmin ckconfig -v
Platform LSF 6.1, Dec 15 2004
Copyright 1992-2004 Platform Computing Corporation
Reading configuration from /data/lsf/conf/lsf.conf

Checking configuration files ...
Jul 10 14:00:17 2007 2066 5 6.1 /data/lsf/6.1/linux2.4-glibc2.3-x86/etc/lim -C
Jul 10 14:00:17 2007 2066 7 6.1 setMyClusterName: searching cluster files ...
Jul 10 14:00:17 2007 2066 7 6.1 setMyClusterName: local host wmstest-ce02.cr.cnaf.infn.it belongs to cluster lsf-sa3
Jul 10 14:00:17 2007 2066 3 6.1 domanager(): /data/lsf/conf/lsf.cluster.lsf-sa3(13): The cluster manager is the invoker <lsfadmin> in debug mode
Jul 10 14:00:17 2007 2066 6 6.1 reCheckClass: numhosts 4 so reset exchIntvl to 15.00
Jul 10 14:00:18 2007 2066 7 6.1 getDesktopWindow: no Desktop time window configured
Jul 10 14:00:18 2007 2066 6 6.1 Checking Done.
---------------------------------------------------------
No errors found.

Check the LSF Batch configuration files

  • [lsfadmin@wmstest-ce02 lsfadmin]# badmin ckconfig -v
Checking configuration files ...

Jul 10 13:35:15 2007 10653 3 6.1 main(): Local host is not master <wmstest-ce02.cr.cnaf.infn.it>
---------------------------------------------------------
No errors found.

Check the BLparser configuration

  • Blah configuration file (/opt/glite/etc/blah.config) needs manual adjustment, since the one generated by the yaim configuration is not correct (see bug #27947).
    • Set the right values for BLAHPD_ACCOUNTING_INFO_LOG
    • Add the parameter lsf_confpath="Path where the LSF conf file is located"

Network ports and services

  • The port numbers, uses by the batch system, are normally obtained by looking up the LSF service names in the /etc/services file or the services YP map. If it is not possible to modify the service database, these variables can be defined to set the port numbers (usually in the /etc/lsf.conf file):
    • LSF_LIM_PORT (default value: 6879),
    • LSF_RES_PORT (default value: 6878),
    • LSB_MBD_PORT (default value: 6881),
    • LSB_SBD_PORT (default value: 6881)

  • The service gliteCE BLParserLSF uses port lsf_BLPport (default vlaue: 33333). This can be set in the /opt/glite/etc/blah.config file.

The LSF Cluster

Check cluster name and master host name

  • [lsfadmin@wmstest-ce02 lsfadmin]# lsid
Platform LSF 6.1, Dec 15 2004
Copyright 1992-2004 Platform Computing Corporation

My cluster name is lsf-sa3
My master name is wmstest-ce02.cr.cnaf.infn.it

Verify the hosts configuration/informations

  • [lsfadmin@wmstest-ce06 lsfadmin]# lshosts -w
HOST_NAME                       type       model  cpuf ncpus maxmem maxswp server RESOURCES
wmstest-ce02.cr.cnaf.infn.it    LINUX86      PC1133  23.1     2   498M  2023M    Yes ()
wmstest-ce03.cr.cnaf.infn.it    LINUX86      PC1133  23.1     2   498M  2023M    Yes ()
wmstest-ce04.cr.cnaf.infn.it    LINUX86      PC1133  23.1     2   498M  2023M    Yes ()
wmstest-ce06.cr.cnaf.infn.it    LINUX86      PC1133  23.1     2   498M  2023M    Yes ()

Test a simple submission on the listed hosts

  • [lsfadmin@wmstest-ce06 lsfadmin]$ lsgrun -v -m "wmstest-ce04.cr.cnaf.infn.it wmstest-ce02.cr.cnaf.infn.it wmstest-ce06.cr.cnaf.infn.it wmstest-ce03.cr.cnaf.infn.it" hostname
<<Executing hostname on wmstest-ce04.cr.cnaf.infn.it>>
wmstest-ce04.cr.cnaf.infn.it
<<Executing hostname on wmstest-ce02.cr.cnaf.infn.it>>
wmstest-ce02.cr.cnaf.infn.it
<<Executing hostname on wmstest-ce06.cr.cnaf.infn.it>>
wmstest-ce06.cr.cnaf.infn.it
<<Executing hostname on wmstest-ce03.cr.cnaf.infn.it>>
wmstest-ce03.cr.cnaf.infn.it

Check the current load levels

  • [lsfadmin@wmstest-ce06 lsfadmin]# lsload
HOST_NAME       status  r15s   r1m  r15m   ut    pg  ls    it   tmp   swp   mem
wmstest-ce04.cr     ok   0.0   0.0   0.0   0%   8.2   1   160   11G 2023M  381M
wmstest-ce02.cr     ok   0.1   0.0   0.0   1%  19.8   1     4   11G 2024M  210M
wmstest-ce06.cr     ok   0.1   0.5   0.3  14% 155.8   1     1   10G 2021M  173M
wmstest-ce03.cr     ok   0.5   0.0   0.1   3%  27.8   1    24   10G 2023M  374M

Verify the dynamic resource information

  • [lsfadmin@wmstest-ce06 lsfadmin]# bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
wmstest-ce02.cr.cn ok              -      5      0      0      0      0      0
wmstest-ce03.cr.cn ok              -      5      0      0      0      0      0
wmstest-ce04.cr.cn ok              -      5      0      0      0      0      0
wmstest-ce06.cr.cn closed          -      5      0      0      0      0      0

Check available queues and their configuration parameters

  • [lsfadmin@wmstest-ce06 lsfadmin]$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
short            40  Open:Active       -    -    -    -     0     0     0     0
long             40  Open:Active       -    -    -    -     0     0     0     0
infinite         40  Open:Active       -    -    -    -     0     0     0     0

Log files

  • Check proper logging of all services in the gliteCE (remember that the lsf log directory in our testbed is exported through NFS so no check is needed on others nodes). Log files are also monitored during job submissions, and specific job related information is checked. For example we can check the informations logged for job with LSF_ID="7896":
    • LSF events log files are in the directory $LSB_SHAREDIR/"cluster name"/logdir/
[ale@wmstest-ce06 lsf]# grep " 7896 " /data/lsf/work/lsf-sa3/logdir/lsb.events
"JOB_NEW" "6.1" 1184140021 7896 1651 36175899 1 1184140021 0 0 -65535 0 0 "dteam001" -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ""  23.10 18 "short" "" "wmstest-ce06.cr.cnaf.infn.it" "" "" "/dev/null" "/dev/null" "" "/home/dteam001" "1184140021.7896" 0 "" "" "blahjob_pO9750" "#!/bin/bash;# LSF job wrapper generated by lsf_submit.sh;# on Wed Jul 11 09:47:00 CEST 2007;#;# LSF directives:;#BSUB -L /bin/bash;#BSUB -J blahjob_pO9750;#BSUB -q short;#BSUB -f ""/home/dteam001/Condor_glidein/local.cbdf5937b0905bb7" 5 "/home/dteam001/Condor_glidein/local.cbdf5937b0905bb77acb2ab43b7de2be/spool/cluster223.proc0.subproc0/StandardOutput" "home_blahjob_pO9750/StandardOutput.1651.9740.1184140020" 2 "/home/dteam001/Condor_glidein/local.cbdf5937b0905bb77acb2ab43b7de2be/spool/cluster223.proc0.subproc0/StandardError" "home_blahjob_pO9750/StandardError.1651.9740.1184140020" 2 "/opt/glite/bin/BPRserver" "BPRserver.1651.9740.1184140020" 1 "/home/dteam001/Condor_glidein/local.cbdf5937b0905bb77acb2ab43b7de2be/spool/cluster223.proc0.subproc0/JobWrapper.https_3a_2f_2flxb2032.cern.ch_3a9000_2fi5n_5fFBCnt12qwbRBVaQQ5Q.sh" "JobWrapper.https_3a_2f_2flxb2032.cern.ch_3a9000_2fi5n_5fFBCnt12qwbRBVaQQ5Q.sh" 1 "/home/dteam001/Condor_glidein/local.cbdf5937b0905bb77acb2ab43b7de2be/spool/cluster223.proc0.subproc0/user.proxy.lmt" "blahjob_pO9750.1651.9740.1184140020.proxy" 1 "" "default" 1 "LINUX86" "/bin/bash" "" "" "" 7184 0 "" "" "" -1 -1 -1
"JOB_START" "6.1" 1184140027 7896 4 0 0 23.1 1 "wmstest-ce04.cr.cnaf.infn.it" "" "" 0 "" 0 ""
"JOB_START_ACCEPT" "6.1" 1184140027 7896 3577 3577 0
"JOB_EXECUTE" "6.1" 1184140028 7896 1651 3577 "/home/dteam001" "/home/dteam001" "dteam001" 3577 0 "" -1 "" 0
"JOB_STATUS" "6.1" 1184141843 7896 64 0 0 5.2400 1184141843 1 3.990000 1.250000 0 0 -1 0 0 14022 16235 0 0 0 -1 0 0 0 0 0 -1 0 0 0
"JOB_STATUS" "6.1" 1184141843 7896 192 0 0 5.2400 1184141843 0 0 0 0
    • Blahpd accounting logs are in the directory "BLAHPD_ACCOUNTING_INFO_LOG"
[ale@wmstest-ce06 lsf]# cat /opt/glite/var/log/accounting.log-`date +%Y%m%d` | grep '7896'
"timestamp=2007-07-11 07:47:06" "userDN=/C=IT/O=INFN/OU=Personal Certificate/L=Milano/CN=Elisabetta Molinari"  "userFQAN=/dteam/Role=NULL/Capability=NULL" "ceID=wmstest-ce06.cr.cnaf.infn.it:2119/blah-lcglsf-short" "jobID=https://lxb2032.cern.ch:9000/i5n_FBCnt12qwbRBVaQQ5Q" "lrmsID=7896" "localUser=1651"
    • Other LSF daemon log files should be found in the $LSF_LOGDIR directory (Not used in these tests)

Performed Tests

BDII

  • Check Batch System information published through BDII

>ldapsearch -h wmstest-ce05.cr.cnaf.infn.it -p 2170 -x -b mds-vo-name=SA3-TB-CNAF,o=grid "(&(GlueForeignKey=GlueClusterUniqueID=wmstest-ce06.cr.cnaf.infn.it))" GlueCEInfoHostName GlueCEInfoLRMSType GlueCEInfoLRMSVersion GlueCEInfoJobManager

  • Results

#
# filter: (&(GlueForeignKey=GlueClusterUniqueID=wmstest-ce06.cr.cnaf.infn.it))
# requesting: GlueCEInfoHostName GlueCEInfoLRMSType GlueCEInfoLRMSVersion GlueCEInfoJobManager
#

# wmstest-ce06.cr.cnaf.infn.it:2119/blah-lsf-short, SA3-TB-CNAF, grid
dn: GlueCEUniqueID=wmstest-ce06.cr.cnaf.infn.it:2119/blah-lsf-short,mds-vo-nam
 e=SA3-TB-CNAF,o=grid
GlueCEInfoHostName: wmstest-ce06.cr.cnaf.infn.it
GlueCEInfoLRMSType: lsf
GlueCEInfoLRMSVersion: 6.1
GlueCEInfoJobManager: lsf

# wmstest-ce06.cr.cnaf.infn.it:2119/blah-lsf-long, SA3-TB-CNAF, grid
dn: GlueCEUniqueID=wmstest-ce06.cr.cnaf.infn.it:2119/blah-lsf-long,mds-vo-name
 =SA3-TB-CNAF,o=grid
GlueCEInfoHostName: wmstest-ce06.cr.cnaf.infn.it
GlueCEInfoLRMSType: lsf
GlueCEInfoLRMSVersion: 6.1
GlueCEInfoJobManager: lsf

# wmstest-ce06.cr.cnaf.infn.it:2119/blah-lsf-infinite, SA3-TB-CNAF, grid
dn: GlueCEUniqueID=wmstest-ce06.cr.cnaf.infn.it:2119/blah-lsf-infinite,mds-vo-
 name=SA3-TB-CNAF,o=grid
GlueCEInfoHostName: wmstest-ce06.cr.cnaf.infn.it
GlueCEInfoLRMSType: lsf
GlueCEInfoLRMSVersion: 6.1
GlueCEInfoJobManager: lsf

# search result
search: 2
result: 0 Success

# numResponses: 4
# numEntries: 3

JOB SUBMISSION

  • 2000 Job Submission using 2 WMS
    • consecutive submission of jobs of the type 'JDL Type'
    • comments: In the following charts just a 5 minutes delay can be noted in the info propagation from LRMS to the CE Info Providers and BDII. Data about 'waiting' jobs, Estimated Response Time and the load on the CE was collected by using the monitoring scripts described below and running on the CE.

0707271704ERT.png 0707271704Wait.png

JDL Type

[emolinar@atlfarm007 emolinar]$ cat sa3_jdl/test_lsf.jdl
[
JobType         = "Normal";
ShallowRetryCount = 0;
RetryCount      = 0;
Executable      = "job3.sh";
Arguments       = "300 1";
StdOutput       = "job3.out";
StdError        = "job3.err";
InputSandbox    = {"sa3_jdl/JOBs/job3.sh"};
OutputSandbox   = {"job3.out","job3.err"};
ShallowRetryCount = 3
]
[emolinar@atlfarm007 emolinar]$ cat sa3_jdl/JOBs/job3.sh
#!/bin/bash
 
SLEEP_TIME=${1:-0};
DO_MD5=${2:-0};
FILE_MD5=${3:-/dev/urandom};
 
NUM_OF_BG_JOBS=2
 
date
hostname
whoami
pwd
 
 
if [ "$DO_MD5" != "0" -a -r $FILE_MD5 ];then
        for i in `seq 1 $NUM_OF_BG_JOBS`;do
                md5sum $FILE_MD5 &
                job_id=$!
                jobs_array[i]=$job_id
        done
fi
 
if [ "$SLEEP_TIME" != "0" ];then
        sleep $SLEEP_TIME
fi
 
 
if [ "$DO_MD5" != "0" -a -r $FILE_MD5 ];then
         for i in `seq 1 $NUM_OF_BG_JOBS`;do
                 kill ${jobs_array[$i]}
         done
fi
 
exit 0;

Monitoring Scripts

The following was run on the CE:

#!/bin/bash
                                                                                                                       
# the first parameter is the range in seconds between two runs
INTERVALSEC=$1
# the second (optional) parameter is the CE to monitoring
CE="${2:-wmstest-ce06.cr.cnaf.infn.it}"
# the third (optional) parameter id the BDII to query
BDII="${3:-wmstest-ce05.cr.cnaf.infn.it}"
# the fourth (optional) parameter is the TOP BDII to query
TOPBDII="${4:-lxb2033.cern.ch}"
                                                                                                                       
OUT=`date +"%y%m%d%H%M"`"data.out"
                                                                                                                       
echo "#                Load   BatchSystem             GIP                             BDII                           TOPBDII" > $OUT
echo "# TIME            1m     run wait     r   t   w    ERT     WRT        r   t   w    ERT     WRT        r   t   w   ERT     WRT" >> $OUT
                                                                                                                       
# the name of the queue
QUEUE="long"
                                                                                                                       
# awk expression to extract infos from BDII and GIP
PRINTJOBS='/RunningJobs/{r=$2}/TotalJobs/{t=$2}/WaitingJobs/{w=$2}/EstimatedResponseTime/{ERT=$2}/WorstResponseTime/{WRT=$2}END{printf "%2d %3d %3d %6d %7d", r, t, w, ERT, WRT}'
                                                                                                                       
# BDII settings
BDII_PORT=2170
BDII_BASE="mds-vo-name=SA3-TB-CNAF,o=grid"
TOPBDII_BASE="mds-vo-name=SA3-TB-CNAF,mds-vo-name=local,o=grid"
BDII_QUERYCE="(&(GlueForeignKey=GlueClusterUniqueID=$CE)(GlueCEUniqueID=$CE:2119/blah-lsf-$QUEUE))"
BDII_ATTRS="GlueCEStateRunningJobs GlueCEStateWaitingJobs GlueCEStateTotalJobs GlueCEStateEstimatedResponseTime GlueCEStateWorstResponseTime"
 
# we use a "infinite" cycle, so kill the monitoring at the end
while true ; do
 
DATE=`date +"%Y-%m-%d_%H:%M"`
 
# query the batch system for running and waiting jobs
JOBS=`qstat $QUEUE  2>/dev/null|sed -ne "/run/p" | awk '{ printf "%2d %3d", $1, $3 }'`
 
# GIP query
GIP=`/opt/lcg/var/gip/plugin/lcg-info-dynamic-scheduler-wrapper 2>/dev/null|sed -ne "/^'dteam'/,/^$/p"| sed -ne "/$QUEUE/,/^$/p" | awk "$PRINTJOBS"`
 
# BDII query
BDII_QUERY="ldapsearch -h $BDII -p $BDII_PORT -x -b $BDII_BASE $BDII_QUERYCE $BDII_ATTRS"
LDAP=`$BDII_QUERY |awk "$PRINTJOBS"`
 
# TOPBDII query
TOPBDII_QUERY="ldapsearch -h $TOPBDII -p $BDII_PORT -x -b $TOPBDII_BASE $BDII_QUERYCE $BDII_ATTRS"
TOPLDAP=`$TOPBDII_QUERY |awk "$PRINTJOBS"`
 
# 1-minute load average
LA=`cat /proc/loadavg |awk '{ print $1}'`
 
# Print data into the output file
echo -e "$DATE $LA     $JOBS     $GIP       $LDAP       $TOPLDAP" >> $OUT
 
# wait before the next cycle
sleep $INTERVALSEC;
 
done

Required Modifications: BUGs

  • Blah configuration file (/opt/glite/etc/blah.config) needs manual adjustment (bug #27947):
    • Set the right values for BLAHPD_ACCOUNTING_INFO_LOG
    • Add the parameter lsf_confpath="Path where the LSF conf file is located"
  • Modify /opt/glite/libexec/lrmsinfo-lsf according to bug #24494
  • Modify /opt/glite/yaim/functions/config_glite_ce according to bug #27979
  • Modify /opt/lcg/libexec/lcg-info-dynamic-scheduler according to bug #28035
  • Modify /opt/lcg/libexec/lcg-info-dynamic-lsf according to bug #28296

-- Main.emolina - 09 Jul 2007

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2007-08-09 - ElisabettMolinari
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback