TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGOpsMeetingWeek200622
(revision 22) (raw view)
Edit
Attach
PDF
---+!! Week of 200622 %TOC% ---++ WLCG Operations Call details * At CERN the meeting room is [[https://maps.cern.ch/mapsearch/mapsearch.htm?n=%5b%27513/R-068%27%5d][513-R-068]]. * For remote participation we use the Vidyo system. Instructions can be found [[https://indico.cern.ch/conferenceDisplay.py?confId=287280][here]]. ---++ General Information * The purpose of the meeting is: * to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting; * to announce or schedule interventions at Tier-1 sites; * to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites; * to provide important news about the middleware; * to communicate any other information considered interesting for WLCG operations. * The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30. * The SCOD rota for the next few weeks is at ScodRota * Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOT.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting. ---++ Best practices for scheduled downtimes * [[BestPracticesForSchedDT][Best practices for scheduled downtimes]] ---++ Monday Attendance: * local: * remote: Experiments round table: * ATLAS [[AtlasComputing.ADCOperationsWeeklySummaries2020][reports]] ( [[AtlasComputing.ADCOperationsWeeklySummaries2020?raw=on][raw view]]) - * Storage issues at: UKI-LT2-QMUL [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=147558][GGUS:147558]], UNIBE-LHEP [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=147556][GGUS:147556]], UKI-NORTHGRID-LANCS [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=147553][GGUS:147553]], INFN-T1 [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=147462][GGUS:147462]] * CERN Frontiers were down during part of the week * CMS [[CMS.FacOps_WLCGdailyreports][reports]] ( [[CMS.FacOps_WLCGdailyreports?raw=on][raw view]]) - * *It's (virtual) CMS Computing Workshop and likely nobody from CMS can call in* * Bad CMS workflow caused storage overload at CC-IN2P3 * Clarified within CMS and bad WF got aborted - no GGUS ticket * Otherwise no major items to report * ALICE - * NTR * LHCb [[LHCb.ProductionOperationsWLCGdailyReports][reports]] ( [[LHCb.ProductionOperationsWLCGdailyReports?raw=on][raw view]]) - * completing stripping of heavy ions collision data * MC production as usual * planning for DB outage of Saturday * GGUS down -- some impact on operations Sites / Services round table: * ASGC: * BNL: FTS was upgraded last Monday, to version 3.9.4. * CNAF: NTR * EGI: * FNAL: Two incidents with tape cartridges in entirely different libraries last week. We had an RFID chip in a ~4 year old T10kC cartridge fail. Believe this is recoverable, but data is unavailable until Oracle re-finds the file boundaries and recovers. On Friday though we had a more catastrophic failure of a few month old LTO8 tape in our IBM library. Tape itself snapped and wound within the drive. Removed entangled cartridge and drive and shipped to IBM for further investigation. In that one expect about 192 CMS files will be lost. * !IN2P3: during maintenance on last Tuesday : * !CREAM-CEs have been decommissioned. * upgrade of !HTCondorCE to 4.1.0: incident on June 19th with HTCondor job router -> impact on ALICE during the afternoon. * dCache: upgrade to 5.2.22 and rollback to version 5.2.21 (last pools for ATLAS/CMS analysis done today midday) * new endpoints for LHCb on dCache to distinguish disk and tape. Registered in GOC DB. * JINR: NTR * KISTI: * KIT: * NDGF: * NL-T1: * The slow directory listings at the Sara dCache (reported last week) are understood. A user has 3 million files in one dir, that is a bit of a challenge. We fixed this by increasing pnfsmanager.limits.list-threads from 2 (default) to 10. This slightly increases the database load but nothing it can't handle. * The Sara tape backend was down from Saturday until this afternoon. It is operational now. * NRC-KI: * OSG: * PIC: PIC Tier 1 will be in scheduled downtime on Tuesday June 30th, from 08:00 to 14:00 (CERN and local time), in order to perform upgrades on the compute (HTCondor) and storage (dCache and Enstore) services. As usual, access to the CPU farm will be closed right before the start of the SD. * !RAL: NTR * TRIUMF: * CERN computing services: * DB intervention affecting some Compute services * Hammercloud [[https://cern.service-now.com/service-portal?id=outage&n=OTG0057297][OTG0057297]] * !SLURM [[https://cern.service-now.com/service-portal?id=outage&n=OTG0057335][OTG0057335]] * !VOMS * !BOINC [[https://cern.service-now.com/service-portal?id=outage&n=OTG0057334][OTG0057334]] * CERN storage services: * On 23rd of June, Final closure of ATLAS area in CASTOR : access to the ATLAS tree of CASTOR will be definitely blocked, before final migration of data to CTA. [[https://cern.service-now.com/service-portal?id=outage&n=OTG0057293][OTG0057293]] * On 25th of June, the EOSCTA ATLAS instance will be opened for reads and writes [[https://cern.service-now.com/service-portal?id=outage&n=OTG0057317][OTG0057317]] * On 27th of June, all CASTOR and CTA instances will be down due to a database intervention [[https://cern.service-now.com/service-portal?id=outage&n=OTG0057292][OTG0057292]] * On 27th of June, many FTS instances will be unavailable due to DB storage intervention [[https://cern.service-now.com/service-portal?id=outage&n=OTG0057307][OTG0057307]] * CERN databases: * Major intervention on Saturday 27th affecting Oracle and DBoD databases. More details: * OTG:0057263 (Database storage) * OTG:0057252 (DBoD) * OTG:0057201 (Oracle) * GGUS: * Access via CERN Grid CA certificates was refused from Sun afternoon till Mon morning. * Due to an expired CRL. * The problem was announced on the =wlcg-operations= list Sun afternoon. * The GOCDB entry for GGUS provides the contact e-mail to be used for such cases. * Monitoring: We will be sharing the draft reports also with site managers during the week after greenlight from experiments representatives * MW Officer: * Networks: * Security: NTR <del>AOB:</del>AOB:
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r25
<
r24
<
r23
<
r22
<
r21
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r22 - 2020-06-22
-
DavidMason
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback