TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGDailyMeetingsWeek130218
(2013-02-22,
AndreaSciaba
)
(raw view)
E
dit
A
ttach
P
DF
---+!! Week of 130218 %TOC% ---++ Daily WLCG Operations Call details To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following: 1. Dial +41227676000 (Main) and enter access code 0119168, or 2. To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] 3. The scod rota for the next few weeks is at ScodRota ---++ WLCG Availability, Service Incidents, Broadcasts, Operations Web | *VO Summaries of Site Usability* ||||*SIRs* |*Broadcasts* |*Operations Web* | | [[http://dashb-alice-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=ALICE_CRITICAL&group=all%2Bsites&site%5B%5D=CCIN2P3&site%5B%5D=CERN&site%5B%5D=CNAF&site%5B%5D=FZK&site%5B%5D=NIKHEF&site%5B%5D=RAL&site%5B%5D=SARA&type=quality][ALICE]] | [[http://dashb-atlas-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=ATLAS_CRITICAL&group=All%2Bsites&site%5B%5D=BNL-ATLAS&site%5B%5D=CERN-PROD&site%5B%5D=FZK-LCG2&site%5B%5D=IN2P3-CC&site%5B%5D=INFN-T1&site%5B%5D=NDGF-T1&site%5B%5D=NIKHEF-ELPROD&site%5B%5D=pic&site%5B%5D=RAL-LCG2&site%5B%5D=SARA-MATRIX&site%5B%5D=Taiwan-LCG2&site%5B%5D=TRIUMF-LCG2&type=quality][ATLAS]] | [[http://dashb-cms-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=CMS_CRITICAL_FULL&group=Tier1s%2B%252B%2BTier0&site%5B%5D=T0_CH_CERN&site%5B%5D=T1_CH_CERN&site%5B%5D=T1_DE_KIT&site%5B%5D=T1_ES_PIC&site%5B%5D=T1_FR_CCIN2P3&site%5B%5D=T1_IT_CNAF&site%5B%5D=T1_TW_ASGC&site%5B%5D=T1_UK_RAL&site%5B%5D=T1_US_FNAL&type=quality][CMS]] | [[http://dashb-lhcb-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=LHCb_CRITICAL&group=Tier%2B0/1&site%5B%5D=LCG.CERN.ch&site%5B%5D=LCG.CNAF.it&site%5B%5D=LCG.GRIDKA.de&site%5B%5D=LCG.IN2P3.fr&site%5B%5D=LCG.NIKHEF.nl&site%5B%5D=LCG.PIC.es&site%5B%5D=LCG.RAL.uk&site%5B%5D=LCG.SARA.nl&type=quality][LHCb]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents][WLCG Service Incident Reports]] | [[https://operations-portal.egi.eu/broadcast/archive][Broadcast archive]] | [[WLCGOperationsWeb][Operations Web]] | ---++ General Information | *General Information* ||| *GGUS Information* | *LHC Machine Information* | | [[http://itssb.web.cern.ch/][CERN IT status board]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][WLCG Baseline Versions]] | [[http://cern.ch/planet-wlcg][WLCG Blogs]] | GgusInformation | [[https://espace.cern.ch/be-dep-op-lhc-machine-operation/default.aspx][Sharepoint site]] - [[http://op-webtools.web.cern.ch/op-webtools/vistar/vistars.php?usr=LHC1][LHC Page 1]] | <HR> ---++ Monday Attendance: local(Alex B, Belinda, Maarten, Stefan, Steve, Xavier);remote(Boris, Elizabeth, Gareth, Lisa, Michael, Onno, Rolf, Torre, Wei-Jen). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * T0/Central services * NTR * Tier1s * Sat pm: transfer failures to Taiwan. Attributed by site to busy disk servers, OK again and ticket close Sun night. GGUS:91581 * Sun pm: Source errors in transfers from TRIUMF-LCG2 and other CA sites. FTS cannot contact non-CA FTS servers. Site is working on it. GGUS:91588 * Tier 2 calibration centers * Sun am: ES CALIBDISK failures of Functional test transfers, SRM down at IFIC, all file transfers failing. Failure in one RAID group, now offline, Restoring Lustre and SRM. GGUS:91586 * FYI: ATLAS AMOD(s) for this and next week not yet identified. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * no report * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Ongoing activity as before: reprocessing, some prompt-processing, MC and user jobs. * T0: NTR * T1: * IN2P3: Problem with SE at IN2P3 (GGUS:91557) * IN2P3: NAGIOS problem still ongoing at IN2P3 (GGUS:91126). Sites / Services round table: * ASGC * ATLAS and CMS jobs affected by CVMFS 2.0.19 cache filling up due to known bug; for now mitigated by manual cleanups; should be fixed in upcoming (2.1.7) release expected in a number of days * BNL - ntr * FNAL - ntr * IN2P3 - ntr * NDGF - ntr * NLT1 * during the weekend there was 1 dCache pool node stuck, restarted yesterday night * OSG - ntr * RAL * some ongoing issues with the batch system not starting enough jobs, being investigated * dashboards - ntr * GGUS * *NB!!! The italian Tier1 needs to update the host certificate for their ticketing system (ticketing.cnaf.infn.it). The change will be on Wednesday 2013/02/20 around 9:30am CET.* A short interrupt may be perceived in the interface with GGUS as the server needs to be rebooted. Details in Savannah:135912 * grid services - ntr * storage * during the weekend EOS-LHCb was unstable; after SW updates earlier today its behavior looks smoother in the monitoring AOB: ---++ Tuesday Attendance: local(Alex B, Eva, Maarten, Maria D, Stefan, Steve);remote(Boris, Jeremy, Lisa, Matteo, Michael, Pepe, Rob, Rolf, Ronald, Saverio, Tiju, Wei-Jen, Xavier). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * NTR. Most probably no one from ATLAS can connect today. Sorry. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * no report * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs. * T0: NTR * T1: * IN2P3 : NAGIOS problem still ongoing at IN2P3 (GGUS:91126). * Rolf: the ticket is with the SAM team now; we are not aware of changes that might explain why the test works only sometimes * Stefan: the test is failing randomly, the cause is not yet known Sites / Services round table: * ASGC - ntr * BNL - ntr * CNAF - ntr * FNAL - ntr * GridPP * the multi-core ATLAS jobs mentioned in the KIT report are PROOF-Lite jobs * IN2P3 * on March 19 there will probably be an all-day outage for electrical work, details to follow later * KIT * our Frontier Squid servers will be updated between 9-10 UTC tomorrow and the day after, should be transparent * single-core queues have been misused by ATLAS users submitting multi-core jobs, ATLAS are following up * NDGF * we have observed transfer errors due to a network problem, being investigated * NLT1 - ntr * OSG - ntr * PIC * complete downtime on March 26 between 5-19 UTC for electrical maintenance * RAL - ntr * dashboards - ntr * databases - ntr * GGUS/SNOW * experiments will no longer be prompted to inform Maria of important tickets that are not making progress, as such tickets can just be included in the experiment reports of the bi-weekly Operations Coordination meeting * grid services - ntr AOB: ---++ Wednesday Attendance: local(Alexei, Belinda, David, Dirk, Luca C, Maarten, Maria D, Massimo, Stefan, Steve);remote(Boris, John, Kyle, Lisa, Matteo, Michael, Pavel, Pepe, Rolf, Ron, Wei-Jen). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * Central services * GGUS: Problem in opening a TEAM ticket to a specific site: The site parameters haven't been synchronized correctly with GOC DB (GGUS:91634) * GGUS: A shifter could not open tickets with an issue with CRL (GGUS:91610) * SLS for ATLAS !HammerCloud has been outdated (last update: 8 Feb 2013) https://sls.cern.ch/sls/service.php?id=HC.ATLAS * On Feb 12: "the web server that serves the SLS reports is decommissioned and I'm moving the thing to the new one." * T1s and network * FZK-LCG2: There are long standing ggus tickets for problems in transfers between FZK-LCG2 and UK sites (GGUS:87958, GGUS:91439) * RRC-KI-T1: ATLAS has started integrating the RU-T1 (RRC-KI-T1) in ATLAS systems. FTS3 servers at RAL and CERN were used for test fie transfers. * Alexei: the prototype T1 will be used in a reprocessing exercise * Alexei: next week a small reprocessing campaign will run at the T1 sites * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * no report * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs. * T0: NTR * Migration CASTOR -> EOS progressing, estimated to last for another 6 weeks * T1: * IN2P3 : NAGIOS problem still ongoing at IN2P3 (GGUS:91126), logfiles of failed sam probes seem to indicate that the probe is killed by the batch system (logs uploaded to GGUS ticket) * Rolf: we have also involved our batch system experts Sites / Services round table: * ASGC - ntr * BNL - ntr * CNAF * CREAM ce01-lcg will be down Feb 21-28 for upgrade to EMI-2 on SL6 * FNAL - ntr * IN2P3 - nta * KIT * 3 Frontier Squid servers were upgraded OK today, the remaining 3 will be done tomorrow * NDGF * the transfer errors reported yesterday are still being investigated * tomorrow there will be a short downtime of the SRM head node for security patching; it might even cure the transfer errors * NLT1 - ntr * OSG - ntr * PIC - ntr * RAL - ntr * dashboards * this morning the ATLAS job monitoring dashboard was affected by a DB problem, resulting in the job history having a few small gaps * databases * this morning one dashboard application was affected by a change in an Oracle query execution plan, fixed * GGUS/SNOW * The host certificate for the italian ticketing system (ticketing.cnaf.infn.it), announced last Monday, took place this morning and was successful. * Next GGUS Release will be in a week, on 2013/02/27. * An interface between GGUS and the ibergrid RT ticketing system will enter production with next week's GGUS release. The change affects PIC. In case of any problem, please open a GGUS ticket against GGUS or comment in Savannah:130314 . * grid services * there was a problem with the batch system dispatching jobs this morning, fixed * storage - ntr AOB: ---++ Thursday Attendance: local(Alessandro, Alex B, Belinda, Luca M, Maarten, Stefan, Steve, Ueda);remote(Boris, Gareth, Lisa, Marian, Michael, Pepe, Rob, Rolf, Rolf, Ronald, Saverio, Wei-Jen). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * Central services * GGUS: Problem in opening a TEAM ticket to a specific site: GGUS:96634 verified:The site parameters haven't been synchronized correctly with GOC DB/GGUS. * GGUS: A shifter could not open tickets with an issue with CRL. GGUS:91610 in progress: the shifter was recommended to use temporarily the account which is mapped to the certificate. * SLS for ATLAS HammerCloud unavailable (in grey). Fixed, migration to the new hardware completed. * T1s and network * FZK-LCG2 from UK sites file transfer problems: 1340 failures "GRIDFTP_ER.:server err.500" from UKI-SCOTGRID-GLASGOW and 30 failures from UKI-NORTHGRID-LIV-HEP GGUS:87958 in progress updated. * Marian: we are also looking into PerfSONAR measurements * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * no report * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NIKHEF: issues with renewal of VOBOX host cert, fixed (GGUS:91674) * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Ongoing activity as before: reprocessing (CERN, IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs. * T0: * ALARM ticket (GGUS:91690) for afs hosted web service which is not responding. It serves grid jobs for configuration and setup purposes * Stefan: it was due to an accidental !DoS by one machine * Many failures in CASTOR->EOS migration because of different checksums in LFC and CASTOR * Luca: working on it, please send the list of affected files * Stefan: OK, ~300 files from 2008, also on tape; the issue is due to the presence or absence of leading zeroes in the checksum * T1: NTR Sites / Services round table: * ASGC * downtime Feb 25 23:00 to Feb 26 18:00 UTC for upgrades of CASTOR, DPM and storage firmware * BNL - ntr * CNAF - ntr * FNAL - ntr * IN2P3 - ntr * KIT - nta * NDGF * the transfer errors were due to a network problem, things look better now * today's SRM maintenance went OK * NLT1 * this morning SARA had an unscheduled outage: dCache was unavailable due to a loose fiber * OSG - ntr * PIC - ntr * RAL - ntr * dashboards - ntr * grid services - ntr * storage * CASTOR DB NAS will have a HW intervention 17:30-21:30 CET, should be transparent AOB: ---++ Friday Attendance: local(!AndreaS, Kate, Mike, Steve, Belinda, Stefan);remote(Xavier/KIT, Gareth/RAL, Wei-Jen/ASGC, Onno/NL-T1, Michael/BNL, Matteo/CNAF, Lisa/FNAL, Rolf/IN2P3, Rob/OSG, Pepe/PIC). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * Central services * GGUS: A shifter could not open tickets with an issue with CRL. GGUS:91610 solved:the issue with the certificate was fixed, the shifter is able to open/update GGUS now. * CERN/VOMS problems affecting ATLAS production and analysis jobs. (GGUS:91704, GGUS:91706, GGUS:91710). Thanks to Maarten for the quick action and the ALARM (GGUS:91706). * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * CERN: VOMS incident (see below), alarm ticket GGUS:91706 opened yesterday evening ~20:00 [Steve: something similar happened months ago] * CERN: EOS lost 17 files, 12 were dark data * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs. * T0: * ALARM ticket (GGUS:91690) for afs hosted web service which is not responding, understood and fixed * T1: NTR Sites / Services round table: * ASGC: had a CASTOR crash this morning * BNL: ntr * CNAF: ntr * FNAL: ntr * !IN2P3: ntr * KIT: ntr * NL-T1: * one dCache pool was stuck this night for six hours and restarted this morning. It happened already a few times, we hope that a kernel upgrade will fix the problem. * This Monday and Tuesday, SURF-SARA will be in maintenance * PIC: ntr * !RAL: on Thuesday morning we declared a downtime "at risk" to reboot a network switch. The effect should be minimal. * OSG: ntr * CERN batch and grid services: !VOMS incident, wrong host certificate put in place on voms.cern.ch , PESgroup.IncidentVOMSFeb2013 Service broken at 16:10 on Thursday, restored Friday at 07:00 this morning. * CERN storage services: ntr * Dashboards: ntr * Databases: ntr AOB:
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
ppt
ggus-data.ppt
r1
manage
2680.0 K
2013-02-18 - 11:52
MariaDimou
Final ALARM drills for the 3012/02/19 WLCG MB.
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r19
<
r18
<
r17
<
r16
<
r15
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r19 - 2013-02-22
-
AndreaSciaba
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback