TWiki
>
LHCb Web
>
LHCbComputing
>
WMSErrorHandling
(2009-05-10,
RicardoGraciani
)
(raw view)
E
dit
A
ttach
P
DF
---++ *Handling of WMS server Errors and Timeouts in DIRAC _TaskQueue_ Director* In order to successfully fill up the LCG computing resources with pilots for LHCb, with current peaks of over 25 K jobs per day, DIRAC _TaskQueue_ Director Agent needs to sustain a submission rate of at least twice this number to allow for site and wms inefficiencies as well as to have some margin for increasing rates. This means a pilots submission rate over 1000 pilots per hour. <br /> For an efficient Operation, this must be achieved and maintained without the need of human intervention to manually include and exclude servers. <br /> To do so, the Director submits (and previously list-match if required, ie not for SAM jobs, with a configurable [currently=15 minutes] caching time) in parallel (python-) threads to a randomly selected WMS server out of a configurable list [currently=11 servers at all our Tier1 sites]. <br /> All configurable parameters are read on every iteration of the Director (from 1 to few minutes depending on the execution time of the threads) so they can be updated on real time. <br /> Quite often we observed that due to bugs in the server and client gLite-wms implementation, these commands (glite-wms-job-submit/glite-wms-job-list-match) fail or, even worst, take unreasonable large time to complete with cases of complete blocking of the execution thread. According to our experience, in a limited number of cases these errors are due to configuration issues on the server side, while the vast majority of cases are due to different overloads on the server side that result in unpredictable behaviours, return codes, return messages,... <br /> The director sets a maximum execution time for any of this commands of 120 seconds and kills the execution if this time is reached, retrieving all the information provided to _stdout_ and _stderr_ up to this moment. <br /> Additionally two caches with configurable different expiration times [currently=1 hour] are created ( _failingWMSCache_, _ticketsWMSCache_) , their usage is detailed bellow. <br /> When an error code is returned or a timeout is detected, the Director takes several configurable actions: * tries to exclude the affected server from the list considered in the current iteration, since several threads are running some other might have found the problem shortly before the error, or some might already have been directed to this server * if the server is included in the _failingWMSCache_, the error is ignored * the server is added to the _failingWMSCache_ * an error message starts to be prepared with the command failing, the failure (Timeout or command Error), and the full _stdout_ and _stderr_ collected, and a _errorAddress_ is set as destination. * if the server is in the _ticketsWMSCache_ (means that failed once was already added to both the _failingWMSCache_, and the _ticketsWMSCache_; the _failingWMSCache_ was cleared but the server failed again before the _ticketsWMSCache_ expired) then the error message is converted into an alarm message by: * adding extra lines in the top of the message ("Submit GGUS Ticket for this error if not already opened It has been failing at least for %s hours", properly filled depending on the current configuration), and * the destination is changed to a configurable _alarmAddress_ destination. * if the server is not in the _ticketsWMSCache_, it is added with an extended expiration with respect to the one used for _failingWMSCache_. * both error and alarm destinations are currently set to dirac.alarms at gmail.com, the account can be accessed via the usual lhcb password, but these are configurable values and can be disabled by setting them to a null string in the configuration. * if the destination is valid, the Notification system is used to send the message to the requested destination. This configuration has allowed, during the last week, to achieve the necessary submission rate, in excess of 1k pilots per hour when necessary, without any need of manual intervention and with numerous errors from the WMS servers that in the past would have meant almost a full blockage of the pilot submission mechanism. <br /> ---++ <img alt="PilotsperHour.png" src="../../../pub/LHCb/WMSErrorHandling/PilotsperHour.png" /> <br /> -- Main.RicardoGraciani - 10 May 2009
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
png
PilotsperHour.png
r1
manage
66.4 K
2009-05-10 - 06:59
RicardoGraciani
Submitted pilots per WMS for last week.
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r1 - 2009-05-10
-
RicardoGraciani
Log In
LHCb
LHCb Web
LHCb Web Home
Changes
Index
Search
LHCb webs
LHCbComputing
LHCb FAQs
LHCbOnline
LHCbPhysics
LHCbVELO
LHCbST
LHCbOT
LHCbRICH
LHCbMuon
LHCbTrigger
LHCbDetectorAlignment
LHCbTechnicalCoordination
LHCbUpgrade
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LHCb
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback