TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGDailyMeetingsWeek140526
(2014-05-30,
MaartenLitmaath
)
(raw view)
E
dit
A
ttach
P
DF
---+!! Week of 140526 %TOC% ---++ WLCG Operations Call details * At CERN the meeting room is [[https://maps.cern.ch/mapsearch/?centerX=2492565¢erY=1121070¢erScale=2500][513]] R-068. * For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following: 1 Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or 1 To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] * In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found [[https://indico.cern.ch/conferenceDisplay.py?confId=287280][here]]. The SCOD will email the WLCG operations list in case the Vidyo backup should be used. ---++ General Information * The SCOD rota for the next few weeks is at ScodRota * General information about the WLCG Service can be accessed from the [[WLCGOperationsWeb][Operations Web]] ---++ Monday Attendance: * local: Ben (CERN Grid Services), Maria A (SCOD), Maarten (ALICE), Felix (ASGC), Xavi (Storage), Maria D (GGUS), Pablo (Grid Monitoring + GGUS) * remote: Sang-Un (KISTI), Roger (NDGF), Stefano (CMS), Salvatore (CNAF), Rolf (IN2P3), Onno (NL-T1), Kai (ATLAS) Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCOperationsDailyReports2014][reports]] ([[https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCOperationsDailyReports2014?raw=on][raw view]]) - * Central Services * * T0/T1s * BNL short DDM problem on Friday due to dCache headnode failure (solved: GGUS:105640) * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports?raw=on][raw view]]) - * major issue with Argus identified last friday by Middleware Support and Rome admin while following up on GGUS:105545 (see last Thursday report), likely reason for large part of our troubles with glexec: we are particularly concerned about glexec failures at the end of the user job. Pilot can not regain user identity to transfer output and cleanup and job is failed wasting all the used resources. glexec failure at user payload start is less severe and we handle by wait and retry. * story is in GGUS:105597 . Problem description in [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=105597#update#8][here]]: =PEP daemon may be throttled by outgoing OCSP requests (!VOMS and CAnL certificate validators appear to be serialized, so single rogue OCSP responder can bring PEP daemon to halt :-()= * in English: one CA failing to reply quickly to certificate validation request leads to process blocking and all authz requests pending at that time ultimately fail. Sites are affected at random depending on getting jobs from users with certificates of the non responding CA * underlying technical issue GGUS:105666 Argus PEP incorrectly serializes certificate validation * Need robust fix asap. What's the mechanism to prioritize and follow up on middleware issues affecting WLCG nowadays ? * ALICE - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports?raw=on][raw view]]) - * MCsimulation and User jobs. * T0: NTR * T1: NTR During the meeting the ARGUS issue reported by CMS is explained by Maarten who is part of the ARGUS GGUS SU and is aware of the problem. The main problem is in the CANL component where OSCP support is enabled by default. This will need to be disabled to fix the issue. Stefano asks how the priority could be raised to make sure the fix is provided asap. Maarten replies that this is already taken care of as this affects many sites and experiments, not only CMS. !MariaA asks whether the developers have provided a date to release a fix. Maarten says that not all the developers concerned had replied yet. !MariaA also asks how could WLCG inform about this known issue so other sites and experiments do not spend time debugging and understanding this. It seems ARGUS suffers from other instabilities recently so it is not always clear whether ARGUS problems are due to this or not. To be followed up offline. Sites / Services round table: * ASGC: NTR * BNL: Not present * CNAF: NTR * FNAL: Not present * IN2P3: NTR * JINR: Not present * KISTI: NTR * KIT: Not present * NDGF: NTR * NL-T1: NTR * PIC: Not present * RAL: Not present * RRC-KI: Not present * TRIUMF: Not present Central Services: * GGUS: Release done this morning at 06:38 UTC. In this release: new NGI_China, decommission of CMS savannah bridge, and several minor improvements in CMS forms. See [[https://ggus.eu/?mode=release_notes][release notes]] for more details * Alarm for [[https://ggus.eu/?mode=ticket_info&ticket_id=105725][NGI_IT]] still open * Alarms for UK and US will be done tomorrow * CERN Grid services: Ben informs about a new WMS upgrade and the ramping down of SL5 capacity. It is planned to start draining SL5 queues on the 19th of June. Maarten asks whether this means that SL5 capacity will be no longer available as of that date. Ben clarifies that this only affects job submission through the old SLC5 CEs (ce201 ... ce207). To be followed up offline. * Storage: Xavi reports about a Castor CMS upgrade that took place during the day and took a bit longer than expected due to an unrelated problem with Castor-Public that needed urgent investigation. The delay had no impact and was properly announced at the SSB. ATLAS EOS will be updated on the 27.05.2012 from 10 to 10h30. AOB: * <big> Next meeting on %RED% *Friday* %BLACK% </big> ---++ Thursday: Ascension holiday * <big> The meeting will be held on %RED% *Friday* %BLACK% instead. </big> ---++ Friday Attendance: * local: Andrea M (MW Officer), Ben (CERN batch and grid services), Felix (ASGC), Maarten (SCOD), Pablo (grid monitoring + GGUS) * remote: Alexei (ATLAS), Antonio (CNAF), John (!RAL), Ken (CMS), Michael (BNL), Roger (NDGF), Rolf (!IN2P3), Vladimir (LHCb) Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCOperationsDailyReports2014][reports]] ([[https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ADCOperationsDailyReports2014?raw=on][raw view]]) - * Central Services * * T0/T1s * FZK transfer failures GGUS:105803 , to be followed up at 3pm meeting. * TRIUMF tape staging problems GGUS:105886 . * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports?raw=on][raw view]]) - * It has been very quiet out there; scattered reports of sites that are running out of jobs, but we are working with the sites to understand why and getting things moving. * With the demise of the Savannah-GGUS bridge, we're busy learning more about GGUS. I for one started a ticket drill on the OSG; on Tuesday, I sent test tickets to all the OSG T2 sites, just asking them to report who received notification of the ticket and to figure out how to get the ticket closed. Only half the tickets have been closed so far. We'll keep working with OSG and the sites on this. * ALICE - * CERN * SLC6 !CREAM CEs often reporting wrong job numbers in the BDII (GGUS:105855) * started Wed evening * ALICE needed to babysit VOBOXes to avoid overloading !LSF with job submissions * looking OK again since yesterday mid evening * KIT * after the maintenance downtime a network configuration issue prevented full use of the SE * fixed Wed evening, thanks! * high network load due to usage of old SW versions by various ALICE users * they have been asked to switch to newer versions ASAP * we look further into preventing easy access to old versions * since Mon the vast majority are no longer available, but some were kept * the jobs cap has been lowered to mitigate the issue * NDGF * job failures due to many files not found in dCache; being debugged * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports?raw=on][raw view]]) - * MCsimulation and User jobs. * T0: NTR * T1: * GRIDKA: * Pilots aborted (GGUS:105867) * Files not available (GGUS:105875) * Pilots submitted through WMS Cleared (GGUS:105885) Sites / Services round table: * ASGC: ntr * BNL: ntr * CNAF: ntr * FNAL: * !GridPP: * !IN2P3: ntr * JINR: * KISTI: * KIT: * NDGF: * scheduled downtime next Mon for tape server upgrade; some ALICE or ATLAS data might be temporarily unavailable * NL-T1: * OSG: * PIC: * !RAL: ntr * RRC-KI: * TRIUMF: * CERN batch and grid services: * the ATLAS LFC daemons have been switched off and the service taken out of SLS monitoring * CERN storage services: * Databases: * GGUS: * One of the [[https://ggus.eu/?mode=ticket_info&ticket_id=105725][test alarms]] done during the release took 14 hours to be acknowledged * CNAF will look into what went wrong there * Grid Monitoring: ntr * MW Officer: * the fix for the DPM 1.8.8 bug affecting CMS T2 FTS-2 transfers has been released in EPEL AOB:
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r12
<
r11
<
r10
<
r9
<
r8
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r12 - 2014-05-30
-
MaartenLitmaath
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback