TWiki
>
LCG Web
>
LCGGridDeployment
>
LCGProductionServices
>
GLiteWMSLog
>
GLiteWMSCMSTests
(2006-11-06,
AndreaSciaba
)
(raw view)
E
dit
A
ttach
P
DF
---+ CMS Tests with the gLite Workload Management System %TOC% ---+++ 11 October, 2006 * Application: CMSSW_0_6_1 * WMS host: rb109.cern.ch * RAM memory: 4 GB * LB server: rb109.cern.ch ([[#LbNote][*]]) * Number of submitted jobs: 25000 * Number of jobs/collection: 100 * Number of collections actually submitted: 234 * Number of CEs: 24 * Submission start time: 10/10/06, 18:45 * Submission end time: 10/12/06, 19:10 * Maximum number of planners/DAG: 2 ---++++ Memory usage During the submission, the swap memory usage increased linearly up to 40%, and decreased rapidly shortly after the job submission stopped. This means that, at some point, the total memory used was 5.8 GB. The number of planners reached about 250, which accounted for about 1.4 GB. Other processes which used a lot of memory are the WMProxy server (>1.5 GB), the WM (>0.5 GB) and Condor (>0.4 GB). Concerning WMProxy, the reason why it took so much memory is not clear, and it was suggested to decrease the number of server threads (30 being the current value). In detail, one could: * reduce the maximum number of processes running simultaneously (-maxProcesses, -maxClassProcesses) * make the "idle" processes killing policy more aggressive (-KillInterval: default is 300 secs)) This is done by changing the "FastCgiConfig" directive at the end of file /opt/glite/etc/glite_wms_wmproxy_httpd.conf as follows: <verbatim> FastCgiConfig -restart -restart-delay 5 -idle-timeout 3600 *-KillInterval 150 \ -maxProcesses 25 -maxClassProcesses 10 -minProcesses 5* ..... (keep the rest as it is) </verbatim> Concerning the !WorkloadManager, it is a known issue that the Task Queue eats a lot of memory; this is not seen in gLite 3.1. ---++++ Performances During the job submission, attempts to submit jobs from another UI took unreasonable amounts of time (~3' for a single job), probably due to the high level of swapping. During the submission, the number of jobs in Submitted status kept increasing, meaning that the WMS could not keep up with the submission rate. After the end of the submission, it took about 10 hours to dispatch all the jobs. Again, this is probably due to a general slowness of the machine due to the swapping. The submission rate was also very close to the maximum dispatch rate, and if it was actually a bit higher, jobs would keep accumulating even without the swap memory effect. Therefore it is recommended to submit at a rate significantly lower (maybe 70%?). It is also important to have as soon as possible the fix which limits the time for which the WM tries to match jobs in the task queue: the current limit of 24 h is too long, because a collection whose jobs cannot be matched is kept alive for a long time, even if it is clear that the jobs cannot ever be matched. I noticed that, on a very busy RB, jobs in a collection may be matched even 24 hours after submission: <verbatim> - JOBID: https://rb109.cern.ch:9000/nrnlNkJeABRP9u6flFPfyA Event Time Reason Exit Src Result Host RegJob 10/10/06 20:52:12 NS rb109.cern.ch RegJob 10/10/06 20:52:15 NS rb109.cern.ch RegJob 10/10/06 20:53:22 NS rb109.cern.ch HelperCall 10/11/06 21:10:33 BH rb109.cern.ch Pending 10/11/06 21:29:02 NO_MATCH BH rb109.cern.ch </verbatim> #LbNote Note: I discovered that I never really used a separate LB server: the LBAddress attribute must be in the common section of the JDL for a collection, not in the node JDL. Another possibility is to configure the RB to use it in the RB configuration. I a recently released tag, an LB server can be specified also in the UI configuration. ---+++ 13 October, 2006 * Application: CMSSW_0_6_1 * WMS host: rb109.cern.ch * RAM memory: 4 GB * LB server: lxb7026.cern.ch * Number of submitted jobs: 14000 * Number of jobs/collection: 100 * Number of collections actually submitted: 140 * Number of CEs: 28 * Submission start time: 10/13/06, 11:10 * Submission end time: 10/14/06, 9:43 * Maximum number of planners/DAG: 2 ---+++ 30 October, 2006 * Application: CMSSW_0_6_1 * WMS host: lxb7283.cern.ch * Flavour: gLite 3.1 * RAM memory: 4 GB * LB server: lxb7283.cern.ch * Number of submitted jobs: 2400 * Number of jobs/collection: 100 * Number of collections actually submitted: 24 * Number of CEs: 24 * Submission start time: 10/30/06, 12:30 * Submission end time: 10/30/06, 12:59 * Maximum number of planners/DAG: 10 ---++++ Summary table <verbatim> Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc cclcgceli02.in2p3.fr 23 0 0 1 0 76 0 0 0 0 ce01-lcg.cr.cnaf.infn.it 0 0 0 0 0 100 0 0 0 0 ce01-lcg.projects.cscs.ch 0 0 0 0 0 100 0 0 0 0 ce03-lcg.cr.cnaf.infn.it 0 0 0 0 0 100 0 0 0 0 ce04.pic.es 0 0 0 0 0 100 0 0 0 0 ce106.cern.ch 0 0 0 0 0 100 0 0 0 0 ceitep.itep.ru 0 0 0 0 0 100 0 0 0 0 cmslcgce.fnal.gov 0 0 0 0 0 100 0 0 0 0 cmsrm-ce01.roma1.infn.it 0 0 0 0 0 100 0 0 0 0 dgc-grid-40.brunel.ac.uk 0 0 0 0 0 100 0 0 0 0 egeece.ifca.org.es 80 20 0 0 0 0 0 0 0 0 grid-ce0.desy.de 0 0 0 0 0 100 0 0 0 0 grid-ce1.desy.de 0 0 0 0 0 100 0 0 0 0 grid-ce2.desy.de 0 0 0 0 0 100 0 0 0 0 grid10.lal.in2p3.fr 0 0 0 0 0 100 0 0 0 0 grid109.kfki.hu 0 0 0 0 0 100 0 0 0 0 gridba2.ba.infn.it 0 0 0 0 0 100 0 0 0 0 gridce.iihe.ac.be 0 0 0 0 0 97 3 0 0 0 gw39.hep.ph.ic.ac.uk 0 0 0 49 0 3 0 48 0 0 lcg00125.grid.sinica.edu.tw 0 0 0 9 9 73 0 9 0 0 lcg06.sinp.msu.ru 0 0 0 100 0 0 0 0 0 0 oberon.hep.kbfi.ee 0 0 0 100 0 0 0 0 0 0 polgrid1.in2p3.fr 0 0 0 0 0 100 0 0 0 0 t2-ce-02.lnl.infn.it 0 0 0 0 0 100 0 0 0 0 </verbatim> ---++++ Comments The Submitted jobs at cclcgceli02.in2p3.fr have indeed finished, but in the logging info the last event is a !RegJob, whose timestamp, however, is close to the other !RegJob events. In addition, for those jobs glite-job-status -v 3 reports timestamps only for Submitted and Waiting. This is linked to the fact that the sequence code of the logged event is wrong: the last !RegJob event had a sequence code UI=000000:NS=0000000001:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000 while the first event from WM/BH had UI=000000:NS=0000000000:WM=000000:BH=0000000001:JSS=000000:LM=000000:LRMS=000000:APP=000000 instead of UI=000000:NS=0000000001:WM=000000:BH=0000000001:JSS=000000:LM=000000:LRMS=000000:APP=000000 which causes al subsequent events to be considered prior to the last !RegJob. The reason of this behaviour is not yet understood. The aborted jobs at gw39.hep.ph.ic.ac.uk and lcg00125.grid.sinica.edu.tw had the "unspecified gridmanager error". The 3 failed jobs at gridce.iihe.ac.be have the "Got a job held event, reason: Globus error 124: old job manager is still alive" error. The jobs at egeece.ifca.org.es are either Submitted or Waiting because the no CE can be matched. The Waiting jobs are 20, which is strange because the maximum number of planners/DAG is 10. ---+++ 1 November, 2006 * VOMS proxy duration: 48 hours * Application: CMSSW_0_6_1 * WMS host: lxb7283.cern.ch * Flavour: gLite 3.1 * RAM memory: 4 GB * LB server: lxb7026.cern.ch * Number of submitted jobs: 21000 * Number of jobs/collection: 200 * Number of collections actually submitted: 105 * Number of CEs: 21 * Submission start time: 11/01/06, 12:03 * Submission end time: 11/02/06, 11:27 * Maximum number of planners/DAG: 10 ---++++ Summary table <verbatim> Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc ce01-lcg.cr.cnaf.infn.it 20 0 0 0 0 980 0 0 0 0 ce03-lcg.cr.cnaf.infn.it 107 0 0 0 0 893 0 0 0 0 ce04.pic.es 37 0 0 0 2 961 0 0 0 0 ce101.cern.ch 40 1 0 0 1 934 0 24 0 0 ce102.cern.ch 215 12 0 0 0 0 0 773 0 0 ce105.cern.ch 178 13 0 0 0 0 0 809 0 0 ce106.cern.ch 236 0 0 0 0 764 0 0 0 0 ce107.cern.ch 288 25 0 0 0 117 0 570 0 0 ceitep.itep.ru 110 0 0 0 0 890 0 0 0 0 cmslcgce.fnal.gov 49 0 0 0 0 951 0 0 0 0 cmsrm-ce01.roma1.infn.it 259 0 0 0 0 741 0 0 0 0 dgc-grid-40.brunel.ac.uk 26 0 0 0 0 974 0 0 0 0 grid-ce0.desy.de 228 1 0 0 0 771 0 0 0 0 grid10.lal.in2p3.fr 69 0 0 0 0 931 0 0 0 0 grid109.kfki.hu 50 0 0 0 0 950 0 0 0 0 gridce.iihe.ac.be 244 0 0 0 0 745 11 0 0 0 gw39.hep.ph.ic.ac.uk 0 0 0 0 0 442 168 383 0 7 lcg00125.grid.sinica.edu.tw 7 57 0 0 0 773 17 146 0 0 lcg02.ciemat.es 16 0 0 0 0 980 0 4 0 0 oberon.hep.kbfi.ee 206 0 0 0 0 392 359 43 0 0 t2-ce-02.lnl.infn.it 375 0 0 0 0 625 0 0 0 0 </verbatim> ---++++ Comments by CE ---+++++ ce01-lcg.cr.cnaf.infn.it 20 jobs apparently Submitted but finished, with a RegJob event at the end of the logging info. ---+++++ ce03-lcg.cr.cnaf.infn.it 38 jobs stuck in Submitted status (only 3 RegJob events in logging info): error message " cannot create LB context". For the other Submitted jobs, see above. ---+++++ ce04.pic.es 37 jobs apparently Submitted. ---+++++ ce101.cern.ch 24 jobs are Submitted with reason (!) "no matching resources found"; they have a Pending event before the third RegJob event. The other 16 Submitted jobs have no reason and the third RegJob is before the first Pending. 13 Aborted with "X509 proxy expired" because no CE could be matched before the proxy expired. 11 Aborted with "request expired" because no CE could be matched in 24 hours. 1 Running, no termination events were received. 1 Waiting, no Abort event was logged. ---+++++ ce102.cern.ch -- Main.AndreaSciaba - 30 Oct 2006
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r6
<
r5
<
r4
<
r3
<
r2
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r6 - 2006-11-06
-
AndreaSciaba
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback