---+!! Week of 131216 %TOC% ---++ WLCG Operations Call details To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following: 1. Dial +41227676000 (Main) and enter access code 0119168, or 2. To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] The scod rota for the next few weeks is at ScodRota ---++ WLCG Availability, Service Incidents, Broadcasts, Operations Web | *VO Summaries of Site Usability* ||||*SIRs* |*Broadcasts* |*Operations Web* | | [[http://dashb-alice-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=ALICE_CRITICAL&group=all%2Bsites&site%5B%5D=CCIN2P3&site%5B%5D=CERN&site%5B%5D=CNAF&site%5B%5D=FZK&site%5B%5D=NIKHEF&site%5B%5D=RAL&site%5B%5D=SARA&type=quality][ALICE]] | [[http://dashb-atlas-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=ATLAS_CRITICAL&group=All%2Bsites&site%5B%5D=BNL-ATLAS&site%5B%5D=CERN-PROD&site%5B%5D=FZK-LCG2&site%5B%5D=IN2P3-CC&site%5B%5D=INFN-T1&site%5B%5D=NDGF-T1&site%5B%5D=NIKHEF-ELPROD&site%5B%5D=pic&site%5B%5D=RAL-LCG2&site%5B%5D=SARA-MATRIX&site%5B%5D=Taiwan-LCG2&site%5B%5D=TRIUMF-LCG2&type=quality][ATLAS]] | [[http://dashb-cms-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=CMS_CRITICAL_FULL&group=Tier1s%2B%252B%2BTier0&site%5B%5D=T0_CH_CERN&site%5B%5D=T1_CH_CERN&site%5B%5D=T1_DE_KIT&site%5B%5D=T1_ES_PIC&site%5B%5D=T1_FR_CCIN2P3&site%5B%5D=T1_IT_CNAF&site%5B%5D=T1_TW_ASGC&site%5B%5D=T1_UK_RAL&site%5B%5D=T1_US_FNAL&type=quality][CMS]] | [[http://dashb-lhcb-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=LHCb_CRITICAL&group=Tier%2B0/1&site%5B%5D=LCG.CERN.ch&site%5B%5D=LCG.CNAF.it&site%5B%5D=LCG.GRIDKA.de&site%5B%5D=LCG.IN2P3.fr&site%5B%5D=LCG.NIKHEF.nl&site%5B%5D=LCG.PIC.es&site%5B%5D=LCG.RAL.uk&site%5B%5D=LCG.SARA.nl&type=quality][LHCb]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents][WLCG Service Incident Reports]] | [[https://operations-portal.egi.eu/broadcast/archive][Broadcast archive]] | [[WLCGOperationsWeb][Operations Web]] | ---++ General Information | *General Information* ||| *GGUS Information* | *LHC Machine Information* | | [[http://itssb.web.cern.ch/][CERN IT status board]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][WLCG Baseline Versions]] | [[http://cern.ch/planet-wlcg][WLCG Blogs]] | GgusInformation | [[https://espace.cern.ch/be-dep-op-lhc-machine-operation/default.aspx][Sharepoint site]] - [[http://op-webtools.web.cern.ch/op-webtools/vistar/vistars.php?usr=LHC1][LHC Page 1]] | <HR> ---++ Monday Attendance: * local: Alessandro, Belinda, Felix, Jerome, Maarten, Stefan, Steve, Xavier E * remote: Christian, Jose, Lisa, Michael, Onno, Pepe, Rob, Rolf, Sang-Un, Stefano, Tiju, Xavier M Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports2013][reports]] ([[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports2013?raw=on][raw view]]) - * Central services * NTR * T0/T1 * IN2P3-CC SOURCE error during TRANSFER_PREPARATION phase: RQueued GGUS:99777 , solved * INFN-T1 Transfers failing with error Request timeout GGUS:99771 , solved * RAL-LCG2 Transfer failures with "source file doesn't exist" GGUS:99768 , waiting for reply * FZK-LCG2 issue in reading from tape, site is working on it. (FZK internal monitoring which shows no activity http://gridmon-kit.gridka.de/tapeview/atlas/index.html ) * BNL-ATLAS is in scheduled maintenance, US Cloud offline during the first part of the intervention (which affect the network). * openssl issue: https://operations-portal.egi.eu/broadcast/archive/id/1066 * Maarten summarized the events leading up to the broadcast (further details there) and added that besides CREAM also other SLC6.5 services can be affected, e.g. WMS or even storage elements, as reported below by LHCb; as it looks unlikely that !RedHat will re-enable support for 512-bit proxies in a future update, we will need to pursue fixing all "client" instances that still generate such proxies * Rob added that OSG experts are working on reducing the fallout on the OSG side * new "gridsite" versions have just been released now: * [[http://www.eu-emi.eu/emi-2-matterhorn/updates/-/asset_publisher/9AgN/content/update-21-16-12-2013-v-2-10-5-1][EMI-2 Update 21]] * [[http://www.eu-emi.eu/releases/emi-3-monte-bianco/updates/-/asset_publisher/5Na8/content/update-12-16-12-2013-v-3-7-0-1][EMI-3 Update 12]] * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports?raw=on][raw view]]) - * Very quiet weekend. No relevant issues to report. * Rob: the glideinWMS factory at Indiana University ran out of disk space on Fri and has been taken out of the list temporarily, while a new SSD drive is being awaited, which probably will not arrive before Jan * ALICE - * NTR * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports?raw=on][raw view]]) - * reprocessing of Proton-Ion collisions started last week at GRIDKA/CERN, * At other sites main activities are simulation & user jobs * T0: * T1: * FZK: Pilot problems, solved (GGUS:99725) * FZK: Issue with tape system over the week-end, now resolved, staging throughput increasing. * Other: * Problems with FTS3 transfers to CBPF which is running slc6.5. This linux version produces SSL3 handshake problems (GGUS:99398) * Steve: the FTS-3 nodes have almost finished getting reinstalled with SLC6.4 (sic), which we probably can live with for a few weeks * after the meeting FTS-3 project lead Michail Salichos clarified that both the FTS-3 client and the server depend on the "gridsite" provided by EPEL-stable; since the new version should get there soon and the few server instances can be kept on SL6.4 for now, standard updates can be done in Jan; the [[https://svnweb.cern.ch/trac/fts3/wiki][FTS-3 Wiki]] has been updated Sites / Services round table: * ASGC - ntr * BNL * network intervention ongoing * new switch installed, connectivity restarted * new spanning tree algorithm just started, being checked * tomorrow dCache upgrade to v2.6 for SHA-2 support * CNAF - ntr * FNAL - ntr * !IN2P3 - ntr * KISTI * last week's network problems were due to a chain of events: * a logical volume for hypervisor storage accidentally got overwritten in a test * VMs then could not mount their storage * as the DNS was running in a VM, it became unavailable, which caused all kinds of services to fail * the DNS is now running on a physical node and the services have been recovered * KIT * last week the SE for ATLAS was upgraded and ran into file system problems: * 90 TB are still unavailable; the tech support is coming from the US * reading from tape was not possible, but should be OK again now * NDGF * short downtime Wed ~noon CET to reboot some pool nodes and update them to dCache 2.6 * NLT1 * tomorrow evening at-risk downtime for tape back-end; files only on tape will be unavailable for a while * OSG - nta * PIC * Thu Dec 19 downtime for cooling system maintenance plus various upgrades * RAL - ntr * grid services * CVMFS Stratum-0 and -1 have been migrated and upgraded OK * FTS-3 is being downgraded to SL6.4 because of the openssl issue (almost done) * storage * transparent EOS updates to improve http performance and e-groups support: * EOS-CMS ongoing * EOS-ATLAS tomorrow morning AOB: ---++ Thursday Attendance: * local: Alessandro, Belinda, Felix, Jerome, Maarten, Maria D, Pablo, Stefan * remote: Christian, Dennis, John, Lisa, Rob, Rolf, Sonia, Xavier Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports2013][reports]] ([[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports2013?raw=on][raw view]]) - * !OpenSSL issue: is there any official broadcast with the latest news (described by Maarten at the [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek131216#Monday][WLCG ex-daily on Monday]] ) * re-observed for FTS3 Pilot, Steve has been contacted. * Maarten explained that the matter is not fully understood at this time: * there has *not* been a big impact on the infrastructure so far * in direct job submission tests with CERN CREAM and UI instances the delegated proxies ended up with 1024-bit keys, despite that nothing was updated * but 512-bit keys can still be reproduced at DESY-ZN * the SAM WMS have *not* been updated with the new =gridsite= version, hence continue generating 512-bit proxies, yet nobody reported problems due to that * the new =gridsite= has been tested and the update can be done at short notice, if needed * otherwise it will be done in Jan * sites are advised to keep SL6 services on SL6.4 for the next 2 weeks, unless an urgent security update requires SL6.5 * Rob explained that OSG have done an emergency release on Tue to fix the affected Globus components * complications due to the use of a private interface to !OpenSSL; the code now uses a public interface instead * plans for the next weeks at today's [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes131219#ATLAS][WLCG Ops Coordination meeting]] * [[%ATTACHURL%/ADC20131210.pdf][ADC20131210.pdf]]: CVMFS inode issue * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports?raw=on][raw view]]) - * Very quiet days. No relevant issues to report. * ALICE - * Thanks for your contributions to another successful year and best wishes for 2014! * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports?raw=on][raw view]]) - * reprocessing of Proton-Ion collisions in full swing at CERN/GRIDKA * At other sites main activities are simulation & user jobs * T0: * Impressive staging performance (140TB/24h), stage in for reprocessing is finished * T1: * GRIDKA: problems with file access via xroot, switched back to dcap for the time being. <verbatim> * /.\ /..'\ Many season's greetings /'.'\ to all sites and services. /.''.'\ Thanks for all your work /.'.'.\ and support during 2013. /'.''.'.\ ^^^[_]^^^ LHCb Grid Operations Team </verbatim> Sites / Services round table: * ASGC * network intervention tomorrow 07:00-10:00 UTC * CNAF - ntr * FNAL - ntr * !IN2P3 - ntr * KIT * the 90 TB of unavailable ATLAS data are back, checksum verifications are still ongoing * NDGF - ntr * NLT1 * many ATLAS jobs at NIKHEF are failing due to the CVMFS inode counter overflow bug to be fixed in the next release; in the meantime the only cure is to unmount and remount the ATLAS repository, which can only be done when no process has it open * Alessandro: not a big impact so far; you could try a rolling intervention, viz. draining selected WN of ATLAS jobs, then fix those WN * see [[%ATTACHURL%/ADC20131210.pdf][CVMFS inode issue]] * OSG - nta * PIC * Apologies, I cannot attend to today's meeting (Pepe). Today's downtime is going pretty well. We foresee to start services even before the declared end time for today's downtime. * RAL - ntr * dashboards - ntr * GGUS * Reminder: For the Year End period: GGUS is monitored by a system connected to the on-call service. In case of total GGUS unavailability the on-call engineer (OCE) at KIT will be informed and will take appropriate action. If GGUS is available but there is a problem with the workflow, e.g. ALARM to CERN doesn't generate email notification to the operators, then WLCG should submit an ALARM ticket, notifying Site DE-KIT, which triggers a phone call to the OCE. If the web portal is unavailable, contact details for KIT are recorded in the GOCDB. * grid services * lcg-bdii.cern.ch and sam-bdii.cern.ch can have different statuses for services that currently are not found in their site BDII (GGUS:99827) * lcg-bdii would then be wrong due to a faulty component that will be updated in Jan * storage - ntr AOB: * [[%ATTACHURL%/ADC20131210.pdf][ADC20131210.pdf]]: CVMFS inode issue * *THANKS for your contributions to making 2013 a very successful year for WLCG!* * Further challenges and opportunities await us in 2014... :-) * Next meeting: Mon Jan 6 ---+++ Season's Greetings! * [[http://cds.cern.ch/journal/CERNBulletin/2013/52/News%20Articles/1637478?ln=en][Best wishes]] for 2014!
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
pdf
ADC20131210.pdf
r1
manage
169.2 K
2013-12-19 - 15:18
AleDiGGi
CVMFS inode issue
pptx
MB-Dec.pptx
r1
manage
2843.5 K
2013-12-16 - 10:36
PabloSaiz
This topic: LCG
>
WebHome
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGDailyMeetingsWeek131216
Topic revision: r10 - 2013-12-19 - MaartenLitmaath
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback