---+!! Week of 130218 %TOC% ---++ Daily WLCG Operations Call details To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following: 1. Dial +41227676000 (Main) and enter access code 0119168, or 2. To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] 3. The scod rota for the next few weeks is at ScodRota ---++ WLCG Availability, Service Incidents, Broadcasts, Operations Web | *VO Summaries of Site Usability* ||||*SIRs* |*Broadcasts* |*Operations Web* | | [[http://dashb-alice-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=ALICE_CRITICAL&group=all%2Bsites&site%5B%5D=CCIN2P3&site%5B%5D=CERN&site%5B%5D=CNAF&site%5B%5D=FZK&site%5B%5D=NIKHEF&site%5B%5D=RAL&site%5B%5D=SARA&type=quality][ALICE]] | [[http://dashb-atlas-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=ATLAS_CRITICAL&group=All%2Bsites&site%5B%5D=BNL-ATLAS&site%5B%5D=CERN-PROD&site%5B%5D=FZK-LCG2&site%5B%5D=IN2P3-CC&site%5B%5D=INFN-T1&site%5B%5D=NDGF-T1&site%5B%5D=NIKHEF-ELPROD&site%5B%5D=pic&site%5B%5D=RAL-LCG2&site%5B%5D=SARA-MATRIX&site%5B%5D=Taiwan-LCG2&site%5B%5D=TRIUMF-LCG2&type=quality][ATLAS]] | [[http://dashb-cms-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=CMS_CRITICAL_FULL&group=Tier1s%2B%252B%2BTier0&site%5B%5D=T0_CH_CERN&site%5B%5D=T1_CH_CERN&site%5B%5D=T1_DE_KIT&site%5B%5D=T1_ES_PIC&site%5B%5D=T1_FR_CCIN2P3&site%5B%5D=T1_IT_CNAF&site%5B%5D=T1_TW_ASGC&site%5B%5D=T1_UK_RAL&site%5B%5D=T1_US_FNAL&type=quality][CMS]] | [[http://dashb-lhcb-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=LHCb_CRITICAL&group=Tier%2B0/1&site%5B%5D=LCG.CERN.ch&site%5B%5D=LCG.CNAF.it&site%5B%5D=LCG.GRIDKA.de&site%5B%5D=LCG.IN2P3.fr&site%5B%5D=LCG.NIKHEF.nl&site%5B%5D=LCG.PIC.es&site%5B%5D=LCG.RAL.uk&site%5B%5D=LCG.SARA.nl&type=quality][LHCb]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents][WLCG Service Incident Reports]] | [[https://operations-portal.egi.eu/broadcast/archive][Broadcast archive]] | [[WLCGOperationsWeb][Operations Web]] | ---++ General Information | *General Information* ||| *GGUS Information* | *LHC Machine Information* | | [[http://itssb.web.cern.ch/][CERN IT status board]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][WLCG Baseline Versions]] | [[http://cern.ch/planet-wlcg][WLCG Blogs]] | GgusInformation | [[https://espace.cern.ch/be-dep-op-lhc-machine-operation/default.aspx][Sharepoint site]] - [[http://op-webtools.web.cern.ch/op-webtools/vistar/vistars.php?usr=LHC1][LHC Page 1]] | <HR> ---++ Monday Attendance: local(Alex B, Belinda, Maarten, Stefan, Steve, Xavier);remote(Boris, Elizabeth, Gareth, Lisa, Michael, Onno, Rolf, Torre, Wei-Jen). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * T0/Central services * NTR * Tier1s * Sat pm: transfer failures to Taiwan. Attributed by site to busy disk servers, OK again and ticket close Sun night. GGUS:91581 * Sun pm: Source errors in transfers from TRIUMF-LCG2 and other CA sites. FTS cannot contact non-CA FTS servers. Site is working on it. GGUS:91588 * Tier 2 calibration centers * Sun am: ES CALIBDISK failures of Functional test transfers, SRM down at IFIC, all file transfers failing. Failure in one RAID group, now offline, Restoring Lustre and SRM. GGUS:91586 * FYI: ATLAS AMOD(s) for this and next week not yet identified. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * no report * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Ongoing activity as before: reprocessing, some prompt-processing, MC and user jobs. * T0: NTR * T1: * IN2P3: Problem with SE at IN2P3 (GGUS:91557) * IN2P3: NAGIOS problem still ongoing at IN2P3 (GGUS:91126). Sites / Services round table: * ASGC * ATLAS and CMS jobs affected by CVMFS 2.0.19 cache filling up due to known bug; for now mitigated by manual cleanups; should be fixed in upcoming (2.1.7) release expected in a number of days * BNL - ntr * FNAL - ntr * IN2P3 - ntr * NDGF - ntr * NLT1 * during the weekend there was 1 dCache pool node stuck, restarted yesterday night * OSG - ntr * RAL * some ongoing issues with the batch system not starting enough jobs, being investigated * dashboards - ntr * GGUS * *NB!!! The italian Tier1 needs to update the host certificate for their ticketing system (ticketing.cnaf.infn.it). The change will be on Wednesday 2013/02/20 around 9:30am CET.* A short interrupt may be perceived in the interface with GGUS as the server needs to be rebooted. Details in Savannah:135912 * grid services - ntr * storage * during the weekend EOS-LHCb was unstable; after SW updates earlier today its behavior looks smoother in the monitoring AOB: ---++ Tuesday Attendance: local(Alex B, Eva, Maarten, Maria D, Stefan, Steve);remote(Boris, Jeremy, Lisa, Matteo, Michael, Pepe, Rob, Rolf, Ronald, Saverio, Tiju, Wei-Jen, Xavier). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * NTR. Most probably no one from ATLAS can connect today. Sorry. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * no report * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs. * T0: NTR * T1: * IN2P3 : NAGIOS problem still ongoing at IN2P3 (GGUS:91126). * Rolf: the ticket is with the SAM team now; we are not aware of changes that might explain why the test works only sometimes * Stefan: the test is failing randomly, the cause is not yet known Sites / Services round table: * ASGC - ntr * BNL - ntr * CNAF - ntr * FNAL - ntr * GridPP * the multi-core ATLAS jobs mentioned in the KIT report are PROOF-Lite jobs * IN2P3 * on March 19 there will probably be an all-day outage for electrical work, details to follow later * KIT * our Frontier Squid servers will be updated between 9-10 UTC tomorrow and the day after, should be transparent * single-core queues have been misused by ATLAS users submitting multi-core jobs, ATLAS are following up * NDGF * we have observed transfer errors due to a network problem, being investigated * NLT1 - ntr * OSG - ntr * PIC * complete downtime on March 26 between 5-19 UTC for electrical maintenance * RAL - ntr * dashboards - ntr * databases - ntr * GGUS/SNOW * experiments will no longer be prompted to inform Maria of important tickets that are not making progress, as such tickets can just be included in the experiment reports of the bi-weekly Operations Coordination meeting * grid services - ntr AOB: ---++ Wednesday Attendance: local(Alexei, Belinda, David, Dirk, Luca C, Maarten, Maria D, Massimo, Stefan, Steve);remote(Boris, John, Kyle, Lisa, Matteo, Michael, Pavel, Pepe, Rolf, Ron, Wei-Jen). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * Central services * GGUS: Problem in opening a TEAM ticket to a specific site: The site parameters haven't been synchronized correctly with GOC DB (GGUS:91634) * GGUS: A shifter could not open tickets with an issue with CRL (GGUS:91610) * SLS for ATLAS !HammerCloud has been outdated (last update: 8 Feb 2013) https://sls.cern.ch/sls/service.php?id=HC.ATLAS * On Feb 12: "the web server that serves the SLS reports is decommissioned and I'm moving the thing to the new one." * T1s and network * FZK-LCG2: There are long standing ggus tickets for problems in transfers between FZK-LCG2 and UK sites (GGUS:87958, GGUS:91439) * RRC-KI-T1: ATLAS has started integrating the RU-T1 (RRC-KI-T1) in ATLAS systems. FTS3 servers at RAL and CERN were used for test fie transfers. * Alexei: the prototype T1 will be used in a reprocessing exercise * Alexei: next week a small reprocessing campaign will run at the T1 sites * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * no report * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs. * T0: NTR * Migration CASTOR -> EOS progressing, estimated to last for another 6 weeks * T1: * IN2P3 : NAGIOS problem still ongoing at IN2P3 (GGUS:91126), logfiles of failed sam probes seem to indicate that the probe is killed by the batch system (logs uploaded to GGUS ticket) * Rolf: we have also involved our batch system experts Sites / Services round table: * ASGC - ntr * BNL - ntr * CNAF * CREAM ce01-lcg will be down Feb 21-28 for upgrade to EMI-2 on SL6 * FNAL - ntr * IN2P3 - nta * KIT * 3 Frontier Squid servers were upgraded OK today, the remaining 3 will be done tomorrow * NDGF * the transfer errors reported yesterday are still being investigated * tomorrow there will be a short downtime of the SRM head node for security patching; it might even cure the transfer errors * NLT1 - ntr * OSG - ntr * PIC - ntr * RAL - ntr * dashboards * this morning the ATLAS job monitoring dashboard was affected by a DB problem, resulting in the job history having a few small gaps * databases * this morning one dashboard application was affected by a change in an Oracle query execution plan, fixed * GGUS/SNOW * The host certificate for the italian ticketing system (ticketing.cnaf.infn.it), announced last Monday, took place this morning and was successful. * Next GGUS Release will be in a week, on 2013/02/27. * An interface between GGUS and the ibergrid RT ticketing system will enter production with next week's GGUS release. The change affects PIC. In case of any problem, please open a GGUS ticket against GGUS or comment in Savannah:130314 . * grid services * there was a problem with the batch system dispatching jobs this morning, fixed * storage - ntr AOB: ---++ Thursday Attendance: local(Alessandro, Alex B, Belinda, Luca M, Maarten, Stefan, Steve, Ueda);remote(Boris, Gareth, Lisa, Marian, Michael, Pepe, Rob, Rolf, Rolf, Ronald, Saverio, Wei-Jen). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * Central services * GGUS: Problem in opening a TEAM ticket to a specific site: GGUS:96634 verified:The site parameters haven't been synchronized correctly with GOC DB/GGUS. * GGUS: A shifter could not open tickets with an issue with CRL. GGUS:91610 in progress: the shifter was recommended to use temporarily the account which is mapped to the certificate. * SLS for ATLAS HammerCloud unavailable (in grey). Fixed, migration to the new hardware completed. * T1s and network * FZK-LCG2 from UK sites file transfer problems: 1340 failures "GRIDFTP_ER.:server err.500" from UKI-SCOTGRID-GLASGOW and 30 failures from UKI-NORTHGRID-LIV-HEP GGUS:87958 in progress updated. * Marian: we are also looking into PerfSONAR measurements * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * no report * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * NIKHEF: issues with renewal of VOBOX host cert, fixed (GGUS:91674) * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Ongoing activity as before: reprocessing (CERN, IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs. * T0: * ALARM ticket (GGUS:91690) for afs hosted web service which is not responding. It serves grid jobs for configuration and setup purposes * Stefan: it was due to an accidental !DoS by one machine * Many failures in CASTOR->EOS migration because of different checksums in LFC and CASTOR * Luca: working on it, please send the list of affected files * Stefan: OK, ~300 files from 2008, also on tape; the issue is due to the presence or absence of leading zeroes in the checksum * T1: NTR Sites / Services round table: * ASGC * downtime Feb 25 23:00 to Feb 26 18:00 UTC for upgrades of CASTOR, DPM and storage firmware * BNL - ntr * CNAF - ntr * FNAL - ntr * IN2P3 - ntr * KIT - nta * NDGF * the transfer errors were due to a network problem, things look better now * today's SRM maintenance went OK * NLT1 * this morning SARA had an unscheduled outage: dCache was unavailable due to a loose fiber * OSG - ntr * PIC - ntr * RAL - ntr * dashboards - ntr * grid services - ntr * storage * CASTOR DB NAS will have a HW intervention 17:30-21:30 CET, should be transparent AOB: ---++ Friday Attendance: local(!AndreaS, Kate, Mike, Steve, Belinda, Stefan);remote(Xavier/KIT, Gareth/RAL, Wei-Jen/ASGC, Onno/NL-T1, Michael/BNL, Matteo/CNAF, Lisa/FNAL, Rolf/IN2P3, Rob/OSG, Pepe/PIC). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * Central services * GGUS: A shifter could not open tickets with an issue with CRL. GGUS:91610 solved:the issue with the certificate was fixed, the shifter is able to open/update GGUS now. * CERN/VOMS problems affecting ATLAS production and analysis jobs. (GGUS:91704, GGUS:91706, GGUS:91710). Thanks to Maarten for the quick action and the ALARM (GGUS:91706). * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * CERN: VOMS incident (see below), alarm ticket GGUS:91706 opened yesterday evening ~20:00 [Steve: something similar happened months ago] * CERN: EOS lost 17 files, 12 were dark data * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs. * T0: * ALARM ticket (GGUS:91690) for afs hosted web service which is not responding, understood and fixed * T1: NTR Sites / Services round table: * ASGC: had a CASTOR crash this morning * BNL: ntr * CNAF: ntr * FNAL: ntr * !IN2P3: ntr * KIT: ntr * NL-T1: * one dCache pool was stuck this night for six hours and restarted this morning. It happened already a few times, we hope that a kernel upgrade will fix the problem. * This Monday and Tuesday, SURF-SARA will be in maintenance * PIC: ntr * !RAL: on Thuesday morning we declared a downtime "at risk" to reboot a network switch. The effect should be minimal. * OSG: ntr * CERN batch and grid services: !VOMS incident, wrong host certificate put in place on voms.cern.ch , PESgroup.IncidentVOMSFeb2013 Service broken at 16:10 on Thursday, restored Friday at 07:00 this morning. * CERN storage services: ntr * Dashboards: ntr * Databases: ntr AOB:
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
ppt
ggus-data.ppt
r1
manage
2680.0 K
2013-02-18 - 11:52
MariaDimou
Final ALARM drills for the 3012/02/19 WLCG MB.
This topic: LCG
>
WebHome
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGDailyMeetingsWeek130218
Topic revision: r19 - 2013-02-22 - AndreaSciaba
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback