TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGDailyMeetingsWeek130415
(2013-04-18,
AndreaSciaba
)
(raw view)
E
dit
A
ttach
P
DF
---+!! Week of 130415 %TOC% ---++ Daily WLCG Operations Call details To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following: 1 Dial +41227676000 (Main) and enter access code 0119168, or 1 To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] The scod rota for the next few weeks is at ScodRota ---++ WLCG Availability, Service Incidents, Broadcasts, Operations Web | *VO Summaries of Site Usability* |||| *SIRs* | *Broadcasts* | *Operations Web* | | [[http://dashb-alice-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=ALICE_CRITICAL&group=all%2Bsites&site%5B%5D=CCIN2P3&site%5B%5D=CERN&site%5B%5D=CNAF&site%5B%5D=FZK&site%5B%5D=NIKHEF&site%5B%5D=RAL&site%5B%5D=SARA&type=quality][ALICE]] | [[http://dashb-atlas-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=ATLAS_CRITICAL&group=All%2Bsites&site%5B%5D=BNL-ATLAS&site%5B%5D=CERN-PROD&site%5B%5D=FZK-LCG2&site%5B%5D=IN2P3-CC&site%5B%5D=INFN-T1&site%5B%5D=NDGF-T1&site%5B%5D=NIKHEF-ELPROD&site%5B%5D=pic&site%5B%5D=RAL-LCG2&site%5B%5D=SARA-MATRIX&site%5B%5D=Taiwan-LCG2&site%5B%5D=TRIUMF-LCG2&type=quality][ATLAS]] | [[http://dashb-cms-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=CMS_CRITICAL_FULL&group=Tier1s%2B%252B%2BTier0&site%5B%5D=T0_CH_CERN&site%5B%5D=T1_CH_CERN&site%5B%5D=T1_DE_KIT&site%5B%5D=T1_ES_PIC&site%5B%5D=T1_FR_CCIN2P3&site%5B%5D=T1_IT_CNAF&site%5B%5D=T1_TW_ASGC&site%5B%5D=T1_UK_RAL&site%5B%5D=T1_US_FNAL&type=quality][CMS]] | [[http://dashb-lhcb-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time%5B%5D=lastWeek&profile=LHCb_CRITICAL&group=Tier%2B0/1&site%5B%5D=LCG.CERN.ch&site%5B%5D=LCG.CNAF.it&site%5B%5D=LCG.GRIDKA.de&site%5B%5D=LCG.IN2P3.fr&site%5B%5D=LCG.NIKHEF.nl&site%5B%5D=LCG.PIC.es&site%5B%5D=LCG.RAL.uk&site%5B%5D=LCG.SARA.nl&type=quality][LHCb]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents][WLCG Service Incident Reports]] | [[https://operations-portal.egi.eu/broadcast/archive][Broadcast archive]] | [[WLCGOperationsWeb][Operations Web]] | ---++ General Information | *General Information* ||| *GGUS Information* | *LHC Machine Information* | | [[http://itssb.web.cern.ch/][CERN IT status board]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][WLCG Baseline Versions]] | [[http://cern.ch/planet-wlcg][WLCG Blogs]] | GgusInformation | [[https://espace.cern.ch/be-dep-op-lhc-machine-operation/default.aspx][Sharepoint site]] - [[http://op-webtools.web.cern.ch/op-webtools/vistar/vistars.php?usr=LHC1][LHC Page 1]] | --- ---++ Monday Attendance: * local: !AndreaS, Jarka, Simone, !MariaD * remote: Lisa (FNAL), Onno (NL-T1), Michael (BNL), Lucia (CNAF), Wei-Jen (ASGC), Xavier (KIT), Tiju (!RAL), Kyle (OSG), Rolf (!IN2P3), Rob (OSG) Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports2013][reports]] ([[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports2013?raw=on][raw view]]) - * T1s: * Some failures in the weekend copying data to FZK-LCG2. GGUS:93320. Problem went away by itself. The ticket can be closed and another one will be opened if the problem reappears. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports?raw=on][raw view]]) - * No significant events in last days. * LHC / CMS * Rereconstruction of 2012 data in the tails, load at the T1 sites small. User's analysis goes on at constant pace. * CERN / central services and T0 * ntr * Tier-1: * ntr * Tier-2: * ntr * ALICE - * CNAF: VOBOX host cert expired Sat afternoon, causing CNAF to get steadily drained of ALICE jobs since 04:30 today (GGUS:93319) [Lucia adds that they already requested a new certificate but the CA is late in delivering it] * KIT: concurrent jobs cap lowered to 2500 on Fri afternoon to avoid overloading the firewall with off-site SE accesses while the new Xrootd servers are being debugged * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports?raw=on][raw view]]) - * Apologies for not attending due to LHCb Analysis and Software week * Still mainly user jobs * T0: * Still pb with SLS LFC sensor :( * Activities: * Re-stripping postponed until beginning of May, will start with latest reprocessed data to take advantage of files still on disk. Then jobs will trigger staging from tape. Tier1s should be prepared for high staging rate then for achieving the rest ripping within 60 days: | Total | TB | Rate (MB/s) | | CERN-RDST | 259.4 | 50 | | CNAF-RDST | 794.6 | 153 | | GRIDKA-RDST | 643.5 | 124 | | !IN2P3-RDST | 696.8 | 134 | | PIC-RDST | 201.4 | 39 | | !RAL-RDST | 575.8 | 111 | | SARA-RDST | 537.7 | 104 | Sites / Services round table: * ASGC: ntr * BNL: ntr * CNAF: ntr * FNAL: ntr * !IN2P3: ntr * KIT: ntr * NL-T1: reminder from NIKHEF about a network maintenance intervention this Thursday affecting both computing and storage services; the queues will be drained starting from Wednesday afternoon. * PIC: ntr * !RAL: ntr * OSG: ntr * Dashboards: ntr AOB: ---++ Thursday Attendance: * local: !AndreaS, Jarka, Simone, Xavier, Gavin, !MariaD * remote: Xavier (KIT), David (CMS), John (RAL), Lucia (CNAF), Ronald (NL-T1), Rolf (!IN2P3), Rob (OSG) Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports2013][reports]] ([[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports2013?raw=on][raw view]]) - * Central services * SAM-ATLAS-PROD machine was down since 16th of April 13:40 . https://cern.service-now.com/service-portal/view-incident.do?n=INC279368 . The importance of the machine was 5 (very low), while it should have been >50: this has been changed yesterday (Wed) around 14:00 . Problem fixed at 10:30 of Thursday morning. * SARA-MATRIX ~2800 transfer failures: the certificate has expired on Tuesday. GGUS:93353 solved: the same day, the issue was fixed. * !RAL-LCG2: still some transfer failures at low rate with "checksum mismatch". GGUS:93315 updated on Tuesday. Under investigation. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports?raw=on][raw view]]) - * LHC / CMS * Rereconstruction of 2012 data in the tails, load at T1's continues to tail off. Analysis load continues as usual. * CERN / central services and T0 * GGUS:93372 CMS VOMS -- currently correcting some manually entered incomplete DN's. * Tier-1: * GGUS:93440 KIT -- black hole worker node due to CVMFS partition being marked read only -- resolved quickly. * Tier-2: * NTR * ALICE - * central services * a cleanup operation unexpectedly caused a very high I/O load on the AliEn DB for a few hours starting Tue mid morning, causing lots of jobs and services around the grid to time out * the cleanup was interrupted mid afternoon, after which the DB needed a few more hours to roll back * normal conditions were restored early evening * KIT: the new Xrootd servers look better now - thanks! * concurrent jobs cap raised to 10k on Wed afternoon * SARA: dCache Xrootd interface fixed for writing - thanks! (GGUS:93045) * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] ([[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports?raw=on][raw view]]) - Sites / Services round table: * ASGC: ntr * CNAF: ntr * !IN2P3:ntr * KIT: tomorrow, emergency downtime fro 0900LT to 1300LT to update the GPFS cluster; 900 TB of ATLAS data will be offline * NL-T1: NIKHEF is in downtime for network maintenance as previously announced; everything should be back by the end of the afternoon * PIC: ntr * !RAL: next Wednesday all Oracle databases will be patched; the upgrade should be transparent, but in case of problems with the ATLAS 3D database, the ATLAS Frontier servers will fail over to use the !IN2P3 3D database. Rolf confirms that no intervention is planned at !IN2P3 at that time. * OSG: ntr * CERN batch and grid services: Next Tuesday the NFS shared storage used by myproxy.cern.ch to store proxy credentials will be moved to a new system. We expect to preserve read-only access during the intervention so that proxy certificate renewal should keep working, but it will not be possible to store or delete stored proxy certificates for a few minutes. https://itssb.web.cern.ch/planned-intervention/change-backend-storage-server-myproxycernch/23-04-2013 * CERN storage services: last Tuesday from 1500 to 1540 there was a problem on the EOS head nodes for ATLAS and ALICE and the EOS service was unavailable * Dashboards: ntr * GGUS: GGUS Release next Wednesday, April 24 from 06:00 to 07:00 UTC with ALARM test round as usual. Reminder: As announced on https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek130304#Thursday and at the WLCG Ops. Coord. meeting last week, *the GGUS host certificate will be renewed*. This certificate is used for authentication purposes of SOAP and hence impacts all systems that consume GGUS web services. The new certificate is attached to the relevant tickets in Savannah:136227. AOB:
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r9
<
r8
<
r7
<
r6
<
r5
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r9 - 2013-04-18
-
AndreaSciaba
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback