WLCG Service Incident Reports

WLCG Service Incident Report Guidelines

  • Site where the incident took place
  • Service area to which the incident related (Infrastructure, Middleware, DB, Storage or Network)
  • When the problem has been detected
  • How long it lasted
  • Service to which the incident related
  • Experiment(s) impacted by the incident and, if known, which experiment activities were affected
  • Report regarding the incident and problem resolution with a detailed time line
  • What has been done, if anything, to try and make sure the problem won't reappear

  • N.B. Downtimes / degradations are always "user visible" (which is what counts...)

  • Service area is: Infrastructure, Middleware, DB, Storage or Network
  • SIRs are plotted in WLCG quarterly reports both by service area and by time to resolution (Total, > 96h, >24h)

2019

Q3 2019

Site Service Area Date Duration Service Impact Report
CNAF Infrastructure Aug 6-21, 2019 15 days all all computing resources and data unavailable SIR_CNAF_20190829.pdf

2018

Q3 2018

Site Service Area Date Duration Service Impact Report
CERN Storage Jun-Sep 2018 3 months EOS instabilities; some data loss EOS report
KIT Storage 9-10 Aug 2018 1 day dCache Lost (at least) 270k files for CMS. SIR.pdf
IN2P3-CC Storage Aug 2018 - XRootD 110 TB of ALICE data lost due to RAID problem SIR.pdf

Q1 2018

Site Service Area Date Duration Service Impact Report
CERN Databases Feb 2018 5 days LHCb Service degraded LHCb.pdf[SIR.pdf

2017

Q4 2017

Site Service Area Date Duration Service Impact Report
KIT Tape Storage Dec 2017   Tape Archive 4300 files lost total SIR.pdf

Q2 2017

Site Service Area Date Duration Service Impact Report
KIT Infrastructure 31 May 2017 6h GGUS service unavailable SIR_201705.pdf

Q1 2017

Site Service Area Date Duration Service Impact Report
CERN Database though the problem rather related to network 21 Mar 2017 48h CMSR Phedex downtime in CNAF and Wisconsin 20170321_SIR_CERN_PHEDEX.pdf
KIT Storage 12 Jan 2017 - dCache/TSM 7185 ATLAS, 75 LHCb
and 2 CMS files lost
KIT_SIR_Lost_files_after_TSM_DB_storage_crash.pdf

2016

Q4 2016

Site Service Area Date Duration Service Impact Report
TRIUMF Storage 18 December 2016 - dCache Unrecoverable data loss TRIUMF-dcs08lun0_incident_20161218.pdf
ASGC Storage 18 Oct 2016 - DPM 135k ATLAS files (20 TB) lost
due to RAID failure
SIRondatalossinASGCinOct.2016.pdf
INFN-T1 Middleware 1 Oct 2016 3.5 days CREAM jobs had no valid proxy on the WN,
particularly impacting LHCb
post-mortem-CNAF-CE-Problem-Sept-2016.pdf

Q3 2016

Site Service Area Date Duration Service Impact Report
CERN Middleware 15 Sep 2016 33h LSF batch system, CREAM jobs could not be submitted,
strongly impacting ALICE and LHCb
https://twiki.cern.ch/twiki/bin/view/CMgroup/BatchServiceIncident150916

Q2 2016

Site Service Area Date Duration Service Impact Report
PIC Storage 17 December 2015 - Tape Storage a T10KD drive writing off track made several files unreadable SIR_PIC_ATLAS_T10KD_20160519.pdf
SARA Infrastructure 30 June 2016 26 hours Compute and storage Outage SURFsara_SIR_network_outage_30-6-2016.pdf

Q1 2016

Site Service Area Date Duration Service Impact Report
CERN Infrastructure 29 Mar 2 days VOMS ATLAS, CMS and LHCb affected, several experiment services affected, FTS transfers affected Report

2015

Q4 2015

Site Service Area Date Duration Service Impact Report
IN2P3-CC Network 3 Nov 1h Network The router connecting the site to the outside world broke and all external network connections stopped working SIR-IN2P3-CC-network-2015-11-03-v3.pdf
CERN Batch 5 Dec 6h Batch services loss of running jobs, degraded capacity IncidentBatchWorkerNodes

Q3 2015

Site Service Area Date Duration Service Impact Report
CERN Infrastructure 9 Jul 2h CVMFS All CMS Jobs failed on WLCG IncidentCvmfsCMS150709

Q2 2015

Site Service Area Date Duration Service Impact Report
FNAL Storage April 15 15 days dCache Unrecoverable data loss https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/uscmsT1_SIR_042015.pdf

Q1 2015

Site Service Area Date Duration Service Impact Report
SARA Storage January 15 25 days dCache Unrecoverable data loss SURFsara_Service_Incident_Report_-_bw32-1_backplane.pdf

2014

Q4 2014

Site Service Area Date Duration Service Impact Report
IN2P3-CC Network November 26 1.6 hours VOBoxes Various internal services and VOBoxes were cut off the network SIR-IN2P3-CC-network-2014-11-26-v0.pdf
CERN Storage October 11 5 hours CASTOR Outage: backend daemon of the SRM service stopped talking with the CASTOR database https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsSRMCMS20141011
CERN Storage October 14 4 hours CASTOR Outage: backend daemon of the SRM service stopped talking with the CASTOR database https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsSRMCMS20141016
KIT Storage September 30 - Tape Storage due to wrong tape markers loss of 424 files KIT_SIR_Storage_20141023.pdf

Q3 2014

Q2 2014

Site Service Area Date Duration Service Impact Report
KIT Network Apr 1 3 weeks Network many job and data transfer failures for all 4 experiments, due to firewall and OPN overload by ALICE jobs SIR-ALICE-KIT-overload-v2.pdf

Q1 2014

Site Service Area Date Duration Service Impact Report
RAL Infrastructure Mar 5 16h GOCDB topology and downtimes unavailable GOCDB_Outage_5th_March_2014.doc

2013

Q4 2013

Site Service Area Date Duration Service Impact Report
KIT Storage Nov 18 - tape archive 28 CMS files lost A broken tape was spotted, but 28 of its files could not be found cached on disk or at other sites anymore.
CERN Storage Nov 5 - EOS-CMS 78k files lost, 15 TB, 28 users affected https://twiki.cern.ch/twiki/pub/EOS/IncidentsEOSCMSRecursiveRm20131105/20131105-EOSCMS-Service-Report.pdf
(the incident reported is not considered a service incident)
ASGC Storage November 4 - Disk Storage Lost Data (approx 1M files, 140TB of data from ATLASDATADISK) https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC_DATA_LOSS_SIR-NOV_2013.pdf
KIT Storage October 28th - Disk storage, tape archival Lost data https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/KIT_SIR_Storage_20131028.pdf
NL-T1 Storage October 24th 2 months grid storage cluster Unavailability + Data loss (45 files) https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Service_Incident_Report.pdf
CERN Middleware October 7 4h VOMS Proxy creation and renewal failures, large amounts of job and data transfer failures across WLCG https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentVOMSOct2013

Q3 2013

Site Service Area Date Duration Service Impact Report
CERN Infrastructure Sep 18 8h VOBOXes, LFC, FTS, DB various central services of ATLAS, CMS and LHCb impaired, transfer failures, data access errors https://twiki.cern.ch/twiki/bin/viewauth/PESgroup/IncidentSCSSet2013
TRIUMF Storage September 16 - Disk Storage Lost Data https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Storage_incident_report_at_TRIUMF_Sep-16-2013.pdf

Q2 2013

Site Service Area Date Duration Service Impact Report
BNL Storage June 21 - Disk Storage Lost Data https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Service_Incident_Report_for_BNL_Tier1-06-2013.pdf

Q1 2013

Site Service Area Date Duration Service Impact Report
ASGC Storage Mar 27 - CASTOR lost 55 files in Atlas MCTAPE https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC-SIR20130324-Atlas_file_lost.pdf
CERN Infrastructure Feb 21-22 16h VOMS significant number of job and/or data transfer failures for all experiments throughout WLCG VOMS incident Feb 2013
CERN Batch Computing Feb 10 8h Batch Batch system was down (unavailable for users), then dispatched jobs too slowly LSF Master Daemon Crash and Slow Dispatch Issue
CERN Storage Jan 22 8h CASTOR CASTOR DB overload causing transfer slowness, mainly affecting CMS CASTOR DB loads
CERN Infrastructure Jan 19 9h all services relying on grid certificates,
at CERN and elsewhere
many grid services unavailable to many users,
large number of jobs lost
CERN CA CRL incident

2012

Q4 2012

Site Service Area Date Duration Service Impact Report
PIC Storage Dec 10 - dCache LHCb tape deleted (2 unique files lost) https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20121210SIRPICLHCblostfilesontape2.pdf
GridKa File Transfer, Storage Element Nov 27th 20 hours FTS, dCache, LFC, CondDB German cloud down for transfers (FTS users) KIT_SIR_StorageFTS_20121127.pdf
RAL all Nov 20 50h all T1 services unavailable https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20121120_UPS_Over_Voltage
RAL all Nov 7 27h all T1 services unavailable,
166 ATLAS files lost
https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20121107_Site_Wide_Power_Failure
CERN Storage Oct 16 4h CASTOR CASTORCMS severely degraded due to unstable DB execution plan IncidentsCMSOverload20121016
PIC Storage Oct 9 - dCache Accidental ATLAS data deletion https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20121009_PIC_SIR_ATLAS_deleted_files.pdf

Q3 2012

Site Service Area Date Duration Service Impact Report
CNAF Storage Sep 21-27 6d StoRM LHCb data unavailable and queue closed SIR20120921.pdf
CERN Storage 7 Sep n/a EOSCMS accidental user deletion of 1PB of data report pending
ASGC Storage July 29 - Aug 07 10d CASTOR ATLAS and CMS transfer efficiency to Taiwan degraded. T0 export stopped https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120729_SIR_ASGC_STAGERDB.pdf
CERN Infrastructure ~all quarter on-going LSF slow job submission critically affecting ATLAS T0. Dispatch issues affecting ATLAS T0. https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlowResponse2012 ongoing
https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlowDispatch2012
IN2P3 Infrastructure 3-4 Jul 21h CVMFS ATLAS and LHCb job failures SIR-IN2P3-CC-CVMFS-2012-07-03-v0.pdf
IN2P3 Storage 1-2 Jul 30h dCache job and transfer failures, batch on hold SIR-IN2P3-CC-dCache-2012-07-01-v1.pdf

Q2 2012

Site Service Area Date Duration Service Impact Report
IN2P3 Network 29 Jun 4 h Network All outside connectivity lost SIR-IN2P3-CC-network-2012-06-29-v0.pdf
IN2P3 Infrastructure 24 Jun 36 h CVMFS at IN2P3 ATLAS and LHCb jobs crashed, dCache overload by CMS jobs SIR-IN2P3-CC-CVMFSSquid-2012-06-24-v2.pdf
PIC WNs 21 Jun 1 h PIC Tier1 Computing About 17% of the WN capacity switched off due to cooling incident https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120621_SIR_Cooling_Incident_at_PIC.pdf
CERN Storage 18 Jun ~1h CASTOR c2atlas diskservers were not reachable for ~1h https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsRmNodeMisconfiguration20120618
CERN Storage 5 Jun 1 h CASTOR communication problems and client timeouts https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsNameServerContention20120605
PIC WNs 3-4 Jun 18 h PIC Tier1 Computing 18h of service degradation: Number of cores reduced by 60% due to cooling incident https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120603_SIR_Cooling_Incident_at_PIC.pdf
CERN DB 22 May 1.5 h CMS online DB 1.5 hours of high luminosity data lost https://twiki.cern.ch/twiki/bin/view/DB/PostMortem22May12
CERN Storage 22 May 5-40 min CASTOR ~1k unavailable files after transparent DB intervention https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsDegradationDBIntervention20120522
CERN Infrastructure 19-20 April 1 day batch batch system down https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchDown190412
CERN Infrastructure 18-20 April 2 days batch ATLAS Tier-0 job submission system could not keep up with incoming RAW data https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlow180412
ASGC Storage 11-12 April 24 h CASTOR hardware failure, DB crashed https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC_SIR_2012-04-11.pdf
TRIUMF All Tier-1 services 10-11 April 20 h All Tier-1 services Two site-wide power failures https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/TRIUMF-incident-report-april10-2012.pdf
CERN Storage 4 April 1.5 h CASTOR Name Server stuck, 3 CMS files had to be rewritten https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCentralNSStuck20120404
CERN Storage 2 April many days (~10) CASTOR 1 LHCb diskserver hardware issue (files unavailable, finally 3 file systems lost) https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsDiskOnlyDataLoss20120402

Q1 2012

Site Service Area Date Duration Service Impact Report
PIC Storage 15-23 March 8 days Disk (dCache) ATLAS file loss due to RAID corruption (Adaptec 6445): 1269 files permanently lost https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/PostMortemTier-1ServiceIncidentRAIDCORRUPTIONAdaptec644515-03-2012.doc
PIC Storage 8-13 March 5 days Tape (Enstore) LTO5 tape broken, 988 files temporarily unavailable, 1 file permanently lost https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120310SIRATLASlostfileonLTO5tapeG05918.doc
CERN (and probably others) Infrastructure 20 Mar 2012 <=20hrs GGUS Some sites couldn't access GGUS web pages https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRGGUSunreachable20120320
T0+T1s DB Q1 n/a Database Various https://twiki.cern.ch/twiki/bin/view/DB/PhysicsDatabase11gUpgradeReport
PIC All Tier1 services 22 Jan 2012 5 hours All Tier1 services Outage due to site poweroff caused by cooling incident https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120122SIRPowerandCoolingProblematPIC.pdf

2011

Q4 2011

Site Service Area Date Duration Service Impact Report
CERN Compute 17/18 Dec 2011 18 hours CERN batch service Batch service downtime (unavailable for users) IncidentBatch171211
KIT Storage Dez 2011 3 Months tape archival 2 lost files https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_Service_Incident_Report_12082011.pdf
KIT Infrastructure Nov 4-7 2.5 days GGUS external interfaces No ticket updates entered other ticketing systems including SNOW at the T0 https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRSNOWinterfacefailure20111104
RAL Database (was Storage) Oct 22-23 1.5 days CASTOR DB CASTOR down https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111022_Castor_Outage_RAC_Nodes_Crashing
CERN DB Oct 11   GGUS alarms GGUS alarm to IT-DB workflow GGUSalarmToITDBworkflowPostmortemReport11102011
CERN DB Oct 11-12   ATLAS Offline (ATLR) ATLAS Offline database (ATLR) high load https://twiki.cern.ch/twiki/bin/view/DB/PostMortem12Oct11
KIT Network Oct 6 24h GGUS Ticketing systems at the T0 & some T1s couldn't get GGUS updates. https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRDNSfailure20111006

Q3 2011

Site Service Area Date Duration Service Impact Report
CERN DB Sep 27 7.30h CMS Offline CMS offline production database stuck https://twiki.cern.ch/twiki/bin/view/DB/PostMortem27Sep11
BNL DB Sep 6 4.25h Streams for conditions Discrepancy in an Oracle database table https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_BNL_CONDB.pdf
IN2P3 Infrastructure Aug 26 7.5h CE 7500 job failures https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-PowerIncident-2011-08-26-v2.pdf
IN2P3 Infrastructure Aug 15 19h CEs CEs at 100%, others at 85% degradation https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_CCIN2P3_15aug2011.pdf
CERN DB Aug 09 17h CASTOR CASTOR nameserver database overload https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsnameserverDBoverload09Aug2011
CERN DB July 29 Scheduled+2h CASTOR Upgrade-related problems with stager DB (ATLAS and CMS) https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsStagerDBUpgradeIssues04Jul2011
KIT infrastructure July 22 5d GGUS GGUS alarm emails not working 20110727GGUS_Service_Incident_Report.pdf
IN2P3 Databases July 19 3d LFC, FTS, VOMS, 3D, AMI services unavailable, some data loss https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_CCIN2P3_19july2011.pdf
KIT Storage July 12 15d ATLAS dCache 11k files lost https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir-kit-atlas-dcache-20110728.pdf
CERN CASTOR July 7th 4 h CASTOR Garbage Collector taking too long https://twiki.cern.ch/twiki/bin/view/CASTORService/Incidentst1transferfull07Jul2011
CERN DB July 05 Scheduled+7h CASTOR Upgrade-related problems with stager DB (ATLAS and CMS) https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsStagerDBUpgradeIssues04Jul2011
CERN DB July 04 Scheduled+1h CASTOR Upgrade-related problems with stager DB (ATLAS and CMS) https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsStagerDBUpgradeIssues04Jul2011

Q2 2011

Site Service Area Date Duration Service Impact Report
CERN CASTOR June 26 8 h CASTOR CMS was unable to stage files back from tape https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCMSnostageout26Jun2011
PIC Computing/Storage June 10 5h dCache PNFS dCache namespace overload https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Post_Mortem_Tier-1_Service_Incident_dCache_PNFS_overload_10-June-2011_f.pdf
KIT Storage Jun 5 14 d ALICE xrootd managed storage 3% of the files unreadable https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_lost_files_alice_20110526.pdf
CERN Infrastructure May 26 6 wks KDC high KDC load https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/KDC-SIR.pdf
CERN CASTOR May 24 6 h DB overload on the CASTOR CMS instance Progressive degradation gradually affecting 80% of the servie https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCMSDbOverload24May2011
CERN VObox / Lxplus / SVN,CVS / Batch May 24 3 h XLDAP overload and nscd problem Logins blocked, access to software version control blocked, batch jobs failed https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentLdapNscd24052011
PIC Computing May 25-26 12h Batch System BS instabilities, ~600 jobs lost https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Post_Mortem_PIC_Tier-1_SIR_Computing_SSC5_20110525.pdf
ASGC Infrastructure May 21 to May 23 36h Whole Site DC Power Cut https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20110521_SIR_ASGC_DCPOWERCUT.pdf
CERN Batch / Lxplus / Vobox / Lxadm / Castor May 10 8 h Kerberos KDC Logins blocked, batch jobs failed, some file access blocked https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentKerberos10052011
RAL DB May 10 1h LFC Outage After Database Update >80% https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110510_LFC_Outage_After_DB_Update
ASGC Network May 01 to May 08 8 days Storage service and CMS Squid Slow transfer from/to ASGC Taiwan https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20110501_SIR_ASGC_10GbLINKDOWN.pdf
CERN DB Apr 28 1.5h CMS offline DB cluster service was down for 1.5h https://twiki.cern.ch/twiki/bin/view/DB/PostMortem28Apr11
IN2P3 Infrastructure Apr 8 5h various, incl. batch system, LFC, VOBOX job failures, various services unavailable https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-PowerIncident-2011-04-08-v0.pdf

Q1 2011

Site Service Area Date Duration Service Impact Report
IN2P3 Storage Mar 19 3.5h SRM dCache SRM was unusable due to internal overload https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir-in2p3-cc-dcachesrmincident-2011-03-19-v2.pdf
CERN Infrastructure Mar 19 12h Batch system Job submission became slow, then completely unresponsive https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatch190302011
IN2P3 Network Mar 14 40min Batch system no connection to other French sites, but no problems observed for jobs https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-Network-2011-03-14-v1.pdf
CERN DB 11-Mar-11 5h CMS offline production db The database was completely down for ~2 hours and partially not available for 5 hours https://twiki.cern.ch/twiki/bin/view/DB/PostMortem11Mar11
IN2P3 Infrastructure Feb 25-26 13h Batch system 85% of batch system unavailable, jobs lost https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir-in2p3-cc-powerincident-2011-02-25-v0.pdf
IN2P3 Storage Feb 13 3 h Storage service Storage services degraded, no big impact on jobs https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-Network-2011-02-13-v0.pdf
PIC Storage 21-Jan-11 to 08-Feb-11 18 days Storage service 250TB of ATLAS data partially unavailable https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20110310_SIR_PIC_ATLAS_lost_Files.pdf
KIT infrastructure 28-Jan-11 to 02-Feb-11 5 days Batch system, job submission batch system degraded, reduced # of job slots available https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_PBS-Jan11.pdf
CERN DB 25-Jan-11 8h FTS, LFC, SAM, VOMS, dashboards affected services fail, clients may hang https://twiki.cern.ch/twiki/bin/view/DB/PostMortem25Jan11
IN2P3 infrastructure 8-Jul-10 to 7-Jan-11 6 months shared s/w area jobs fail https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-LHCb-AFS-Latency-2010-S2-v2.pdf
CNAF-BNL network 23-Aug-10 to 20-Jan-11 months primary OPN circuit poor transfer performance; ok when switched to backup in preparation

2010

Q4 2010

Site Date Duration Service Area Impact Report
CERN 18 Dec 5 days DB DB Service interruption: ATLARC DB following the power cut at CERN CC https://twiki.cern.ch/twiki/bin/view/DB/PostMortem18Dec10ATLARC
CERN 18 Dec 26 hours for services with weight > 50 power infrastructure Interruption of physics services following power cut https://twiki.cern.ch/twiki/bin/view/FIOgroup/PowerCut101218
CERN 16 Dec 2.5h DB DB ATLR database affected (degradation then complete outage) by FC switch replacement https://twiki.cern.ch/twiki/bin/view/DB/PostMortem16Dec10
CERN 7 Dec 7 days CVS infrastructure CMSSW CVS migration problems https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentCMSSW101210
CERN Nov/Dec 8 days DB DB Reboots of Instance 4 of ATLR database https://twiki.cern.ch/twiki/bin/view/DB/PostMortem15Dec10
KIT 26 Nov 1.5h GGUS infrastructure No web access / no ticket update https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRGGUSfailure20101126
KIT 16 Nov 3.5h GGUS infrastructure No web access/ no ticket update https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRGGUSfailure20101116
IN2P3 11 Nov months AFS storage/infrastructure shared s/w area https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/CCIN2P3-WLCGT1SCM-LHCB-SW-Problem-Report-20101111.pdf
NL-T1 26 Oct 48h DB DB Inconsistency of data at SARA https://twiki.cern.ch/twiki/bin/view/DB/PostMortem26Oct10
CERN 20 Oct 4.5 h Batch infrastructure Severely degraded response from CERN Batch Service https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatch201010
CNAF 6 Oct 5 days CMS storage storage CMS storage down (service interruption) due to GPFS bug https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/POSTmortem-CMS-Oct2010.docx
CERN 4 Oct 2.1 h MyProxy middleware/infrastructure Outage on myproxy.cern.ch after incorrect certificate renewal https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentMyProxy041010
IN2P3 Sep 23 - Nov 22 2 months ATLAS file transfers storage Service degradation https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-Dcache-ATLAS-Transfer-Degradation-2010-Q4-v3.pdf

Q3 2010

Site Date Duration Service Impact Report
ASGC 24 Sep 5 hours to recover almost services except 3D service wiich costs 3 weeks DB DC power cut https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100924_SIR_ASGC_DCPOWERCUT.pdf
CERN 13 Sep 1.5h CMSR DB Spontaneous reboots of nodes 2 and 4 of CMSR https://twiki.cern.ch/twiki/bin/view/DB/PostMortem14Sep10
CERN 10 Sep 4 days DB Real time downstream was not set for LFC replication https://twiki.cern.ch/twiki/bin/view/DB/PostMortem15Sep10
SARA August >3weeks DB Replication for ATLAS conditions and LHCB conditions to SARA stopped https://twiki.cern.ch/twiki/bin/view/DB/PostMortem10Sep10
ASGC 31 Aug 4 days DB CASTOR outage due to STAGER DB problem https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100831_SIR_ASGC_STAGERFTS_DB.pdf
NL-T1 August > week DB ATLAS NL-T1 cloud down, LHCb T1 site http://sirs.grid.sara.nl/docs/NL-T1_SIR-20100818.pdf
CERN 23 Aug 35 h Atlas conditions DB ATLAS data streaming to Tier1 sites stopped https://twiki.cern.ch/twiki/bin/view/DB/PostMortem23Aug10
CERN 20 Aug 4h CMS DB Instability of node 3 and 4 of CMSR https://twiki.cern.ch/twiki/bin/view/DB/PostMortem20Aug10
CERN 9 Aug 16h LHCb online LHCBONR database unavailable https://twiki.cern.ch/twiki/bin/view/DB/PostMortem09Aug10
PIC 25 Jul 30h CE Service Degradation. Cooling problem causing about 50% of WNs to be shutdown (running jobs killed) https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100727_SIR_PIC_CoolingModule.pdf
PIC 22 Jul 10h SE SRM service not available for ATLAS due to a problem with dCache pool costs configuration. https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100727_SIR_PIC_DDN.pdf
PIC 20 Jul 3h CE Computing Service not available after SD due to a wrong gridmapdir migration. https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100727_SIR_PIC_Gridmapdir.pdf
CERN 19 Jul 2h several Cooling failure in the vault https://twiki.cern.ch/twiki/bin/view/FIOgroup/513Temp100719
OSG/GOC 15 Jul 4h GOC GOC Service outage https://twiki.grid.iu.edu/bin/view/Operations/GOCServiceOutageJuly162010
CERN 13 Jul 1:30-9:15 CMS DB Few short interruptions of replication of CMS data from online to offline https://twiki.cern.ch/twiki/bin/view/DB/PostMortem13July10
KIT 10 Jul 4h + site Outage of central and local services due to a cooling failure https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_cooling_failure_20100710.pdf
NL-T1 5 Jul 1 week SE Reduced availability caused by data corruption http://sirs.grid.sara.nl/docs/NL-T1_SIR-20100705.pdf
NDGF 14 Jul 16h SE srm.ndgf.org downtime followed by degradation https://wiki.ndgf.org/display/ndgfwiki/20100714+dCache+server+failure
NDGF 8 Jul 3h LFC LFC downtime on lfc1.ndgf.org https://wiki.ndgf.org/display/ndgfwiki/Operation-Reports-2010.07.08
KIT 5 Jul 18h SE CMS dCache SE down because of hardware failure https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_20100706.pdf

Q2 2010

Site Date Duration Service Impact Report
RAL 30 June   SE 1083 CMS files were lost. http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100630_Disk_Server_Data_Loss_CMS
CERN 29 June 4 h CASTOR CASTOR outage due to AFS https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCastorAFS29Jun2010
CERN 29 June 5 h AFS complete FC disk array - affected CASTOR and also LHC! https://twiki.cern.ch/twiki/bin/view/AFSService/IncidentsArrayFailure29Jun2010
CERN 28 June 4+h CASTOR Log volume slowed down the Castor instances https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsSrmLogFlood28Jun2010
ASGC 29 June ~ 15 hours 3D DB ASGC didn't apply stream LCRs from central 3D DB for 15 hours https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/3D-DB-incident-20100629.pdf
CERN 26 June ~50 min ATLAS offline DB (ATLR) 9 Oracle services did not fail over properly after a node eviction https://twiki.cern.ch/twiki/bin/view/DB/PostMortem26June10
CERN 24,25 June ~10h LHCb Streaming Streaming of LHCb data to PIC was not working during 10 hours, streaming to other Tier1 sites not working for 40 minutes https://twiki.cern.ch/twiki/bin/view/DB/PostMortem24June10
CERN 22 June   CASTOR LDAP high load caused CASTOR to become unresponsive https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsLdapOverloaded22Jun2010
KIT/GridKa 12 June ~3:15h CMS dCache Service down https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_20100612.pdf
CERN 7 June ~3h CREAM CE Job submission failure https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentCREAMCe070610
CERN 2 June 1 day ATLAS and LHCb online and offline databases Database access and quality of DB services compromised https://twiki.cern.ch/twiki/bin/view/DB/PostMortem02June10
CERN 1 June ~2h ATLAS offline and LCGR databases Database services unavailability during scheduled maintenance for rolling upgrade/patching https://twiki.cern.ch/twiki/bin/view/DB/PostMortem31May10
CERN 31 May ~2h CMS online Database services unavailability during scheduled maintenance for rolling upgrade/patching https://twiki.cern.ch/twiki/bin/view/DB/PostMortem31May10
CERN 26 May 10 days CMS offline database Hw failure affecting one node, cluster running at reduced capacity https://twiki.cern.ch/twiki/bin/view/DB/PostMortem26May10
PIC 21 May 19 hours Whole site Site power cut. Outage. https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100521_SIR_PIC_PowerCut.pdf
CERN 14 May - CASTOR Data loss from incorrect recycling https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsAliceRecycled14May2010
GGUS 12 May <=4.5 hours .de domain Domain does not exist https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRdeDNSfailure20100512
CERN-ASGC 12-15 May - LHCOPN Reduced bandwidth SIRCernAsgcLinkMay2010
CNAF 28 and 29 April 9 hours & 12 hours STORM SRM blockage (hardware) followed by MCDISK full and STORM bug https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-CNAF--AtlasSRMoutage-April-2010.pdf
IN2P3 26 Apr 17.5 hours AFS Distributed File System (AFS) crashed after server overload. Batch also affected. https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-AFSoutage-2010-04-26.pdf
IN2P3 24 Apr 17 hours Batch services location service stopped responding to requests blocking most batch system commands https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-BatchOutage-2010-04-24.pdf
IN2P3 20 Apr 9 hours & 5 days Grid Downtime Notification Grid downtime notifications were impossible after two consecutive incidents https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-OperationsPortal-2010-04-22v2.pdf

Q1 2010

Site Date Duration Service Impact Report
CERN 3 Mar 18 hours DB Replication Replication of LHCb conditions Tier0->Tier1, Tier0->online partially down https://twiki.cern.ch/twiki/bin/view/PDBService/StreamsPostMortem#Replication_of_LHCb_conditions_T
IN2P3 15 Feb 4.25 hours Batch Local worker nodes lost network connectivity https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-WNs-disconnected-2010-02-15-2.pdf
PIC 10 feb 7 hours Spanish-CA CRLs expired at CERN Complete blackout of services involving grid certificates either personal or host from Spanish CA at CERN: VOMS, FTS, etc. https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_Rediris_wLCG_formatted.pdf https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentCAProxy1202210
CERN 7 Feb 4 hours Batch Tier-0 Atlas RTT cluster Degraded service on RunTimeTester cluster due to misconfiguration http://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatch0702210
CERN 30 Jan 2 days CASTORATLAS The xroot daemon was looping on the castoratlas name server because of a bug and slowing down all normal name server calls which was causing the migrator policy to fail https://twiki.cern.ch/twiki/bin/viewauth/CASTORService/IncidentsMigrationBacklog01Feb2010
RAL 29 Jan 5 days CASTOR - all instances A scheduled outage to migrate the Castor Databases back to their original disk arrays encountered significant problems resulting in an extended outage. http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100129
ASGC 18 Jan 2 days power system power surge for one second and most services were restarted https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC_incident_report_Jan18_2010.pdf
GridKa/KIT 13 Jan 26 hours site BDII and lcg-CE site BDII query problems and missing lcg-CE information https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_FZK-LCG2_2010-01-13.pdf
IN2P3 4 Jan 6 hours Batch Local batch system database server overload......................... https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-lbms-DB-overload-2010-01-04.pdf

2009

Q4 2009

Site Date Duration Service Impact Report
PIC 19 Dec 4.5 hours Cooling Most of Tier-1 services shutdown to avoid increasing temperature due to cooling failure https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20091219_PIC_Service_Incident_Report.pdf
IN2P3 8 Dec 1.5 hours Networking Grid services unavailability caused by load balancing mechanism failure https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir_in2p3network_outage_10_12_2009.pdf
CERN 2 Dec 2 hours + Site wide power cut Most CC services down https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem02Dec09
RAL 30 Nov n/a Storage LHCb Data Loss Incident at RAL http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091130
CERN 20 Nov 1h SRM/ATLAS SRM high failure rate and restart after thread exhaustion https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20Nov09
CERN 18 Nov 10h CMS Dashboard Performance degradation http://dashboard.cern.ch/reports/CMSmigrationProblem
IN2P3 12 Nov n/a Storage CMS Data Loss Incident at FR-CCIN2P3 https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/2009-11-26_CMS_CCIN2P3_Report.pdf
IN2P3 3 Nov 4h Many Many services have been disturbed due to automatic reboot of machines https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_CCIN2P3_cooling_outage_03nov2009.doc
IN2P3 14 Oct 2009 13h batch only very short jobs able to run https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir_BatchIncident_15_10_09.pdf
CERN 13 Oct 2009 1-2h CASTOR nameserver sick All CASTOR services dead https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20091013
RAL 9 Oct n/a Storage (Castor) data loss from Castor http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091009
IN2P3 8 & 10 Oct 2009 11h (8 Oct) and 6h (10 Oct) SRM crashed SRM service interrupted https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_CCIN2P3_SRM_incident_08oct2009.doc
RAL 4-9 Oct 2009   disk failures -> Oracle problems CASTOR, LFC and FTS services down http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20091004
ASGC continuation - xx Nov MONTHS!!! DB & DM services See presentation at DB workshop http://indico.cern.ch/getFile.py/access?contribId=30&sessionId=4&resId=1&materialId=slides&confId=70892
ASGC 27 Sep - xx Oct >3 weeks DBs down & out......................... https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC-DB-Sep28.pdf

Q3 2009

Site Date Duration Service Impact Report
CERN 21 Sep 2009 08:00 - 18:00 DB Replication ATLAS Replication Tier0->Tier1 down https://twiki.cern.ch/twiki/bin/view/PDBService/StreamsPostMortem
RAL 15 - 17 Sept 2009 2 days CASTOR Disk to Disk (D2D) transfers started failing during a planned upgrade to the NS http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090915
FZK 7 - 16 Sep 2009 10 days ATLAS RAC 3D Streams replication blocked then degraded https://twiki.cern.ch/twiki/bin/viewfile/LCG/WLCGServiceIncidents?rev=1;filename=SIR-FZK-20090907.pdf
CERN 5 & 8 Sept 2009 2 * 2 hours CASTOR LHCb two Castor Database problems https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090905
CERN 26 Aug 2009 18:40 - 23:30 Batch Public and production queues closed https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090826
ASGC 17 Jul 2009 6:00 - 10:00 Power cut Most services went down and restarted https://twiki.cern.ch/twiki/bin/viewfile/LCG/WLCGServiceIncidents?rev=1;filename=power_cut_ASGC.txt
ATLAS 13 Jul 2009 10:00 - 11:00 Central Catalogs Degrade of performance PostMortem13Jul09

Q2 2009

Site Date Duration Service Impact Report
NL-T1 STEP09       https://twiki.cern.ch/twiki/pub/Atlas/Step09Feedback/Post_Mortem_STEP09_NL-T1-0.4.pdf
OPN 10 Jun 09 >1 day LHC OPN primary circuits to ASGC, CNAF, KIT, NDGF, TRIUMF (incl. backup) https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Fibre_Cut_June_2009.pdf
FZK STEP09 many days storage   https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_storage_FZK_GridKa.pdf
ATLAS 27 Jun 09 2 days(?) PVSS2COOL online reconstruction was stopped PostMortem27Jun09
ATLAS 24 Jun 09 8 hours PanDA and ATLR Degraded PanDA service, impact on other offline DB services on ATLR https://twiki.cern.ch/twiki/bin/view/Atlas/PandaAtlrJune2009
CERN 11 Jun 09 n/a LHCb conditions access, LFC scalability problem https://twiki.cern.ch/twiki/bin/view/PSSGroup/LFCReplicaSvcPostMortem
CERN 18 Jun 09 2 hours Batch & CASTOR services down https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090618
IN2P3 10 Jun 09 7 hours GridFTP Transfers https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_GRID-FTP_OUTAGE_2009_06_11-1.pdf
CERN 4 Jun 09 n/a CASTOR LHCb accidental garbage collection of tape0disk1 files https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090604
CERN 3 Jun 09 n/a CASTOR LHCb accidental re-enabling of garbage collection in lhcbdata https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090603
CERN 1 Jun 09 ~4 hours DB services unavailable https://twiki.cern.ch/twiki/bin/view/PSSGroup/StreamsPostMortem#Network_hardware_problem_affecti
PIC 23 - 26 May 09 3 days LFC instability https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Post_mortem_LFC_indicent_23-26_May_2009_-_WikiPIC.pdf
PIC 14 May 09 5 hours cooling down SIR_PIC_COOLING_OUTAGE_2009_05_14.pdf
SARA 04 May 09 36 hours MSS down SIR_SARA_TAPEBACKEND_OUTAGE_2009_05_04.pdf
IN2P3 3 May 09 44 hours cooling down SIR_COOLING_OUTAGE_2009_05_03.pdf
IN2P3 25 Apr 09 7.5 hours MSS down SIR_ROBOTIC_LIBRARY_OUTAGE_2009_04_26-3.pdf
IN2P3 20 Apr 09 12 hours MSS down SIR_ROBOTIC_LIBRARY_OUTAGE_2009_04_22.pdf
CERN 12 Apr 09 VOMS: 2 days, SRM: 1 hours VOMS, SRM Degraded VomsPostMortem2009x04x10
PIC 10 Apr 09 8 hours SRM ATLAS, CMS and LHCb 20090411_SIR_SRM_PIC.pdf
IN2P3 02 Apr 09 24 hours tape service down https://twiki.cern.ch/twiki/pub/LCG/WLCGDailyMeetingsWeek090330/IN2P3_02april2009_WLCG_incident_report.doc

Q1 2009

Site Date Duration Service Impact Report
RAL 29 Mar 09 33 hours complete site down http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090324
RAL 09 Mar 09 24 hours DNS all, especially SRM http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090309
CERN 04 Mar 09 3 hours CASTOR down https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090304
ASGC 25 Feb 09 days to weeks many down Fire in UPS. Partial report on Tuesday in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek090406

2008

Site Date Duration Service Impact Report
NL-T1 21 Oct 08 12 hours most ..........down..........  
ASGC 25 Oct 08 many days CASTOR down http://indico.cern.ch/conferenceDisplay.py?confId=44840
SARA 28 Oct 08 7 hours SE/SRM/tape b/e down https://twiki.cern.ch/twiki/pub/LCG/WLCGDailyMeetingsWeek081103/post_mortem_tape-system_outage_25_10_in_NL.pdf
PIC 31 Oct 08 10 hours SRM down PICServiceIncidentReport20090416
NDGF 18-20 Oct 08 2 days streams - https://twiki.cern.ch/twiki/bin/view/PSSGroup/StreamsPostMortem#Problem_with_ATLAS_replication_f
CERN 24 Oct 08 3-4 hours FTS channels down or degraded https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortemFts24Oct08
CERN 24 Oct 08 2 hours VOMS short interrupt then degraded https://twiki.cern.ch/twiki/bin/view/LCG/VomsPostMortem2008x10x24
RAL 18 Oct 08 55 hours CASTOR downtime http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20081018
CERN 24/10/2008 3-4 hours FTS channels: CERN-ASGC, CERN-IN2P3, CERN-RAL, NIKHEF-CERN, PIC-CERN, SARA-CERN https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortemFts24Oct08
RAL 18/10/2008 55 hours CASTOR ATLAS http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20081018
NIKHEF 04/10/2008 ~36 hours site site https://twiki.cern.ch/twiki/pub/LCG/WLCGDailyMeetingsWeek081020/post_mortem_NL-T1_power_outage-Oct17.txt
RAL 17/09/2008 17h (LHCb) 12h (ATLAS) CASTOR 14K files http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20080917
CNAF 07/09/2008 12+ hours CASTOR complete loss of service https://twiki.cern.ch/twiki/bin/viewfile/LCG/WLCGDailyMeetingsWeek080915?rev=1;filename=post-mortem_of_September_7_CNAF_CASTOR_problem.pdf

Temp Area

TempArea

External References

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf 2009-11-26_CMS_CCIN2P3_Report.pdf r1 manage 78.4 K 2009-12-01 - 15:52 DirkDuellmann CMS Data Loss Incident at FR-CCIN2P3
PDFpdf 20090411_SIR_SRM_PIC.pdf r1 manage 152.9 K 2009-04-16 - 15:10 OlofBarring  
PDFpdf 20091219_PIC_Service_Incident_Report.pdf r1 manage 23.3 K 2009-12-23 - 11:20 GonzaloMerino SIR of the cooling incident at PIC on 19 Dec 2009
PDFpdf 20100521_SIR_PIC_PowerCut.pdf r1 manage 123.4 K 2010-05-31 - 09:39 GonzaloMerino SIR for the power cut affecting PIC Tier1 on 21-22 May 2010
PDFpdf 20100727_SIR_PIC_CoolingModule.pdf r1 manage 75.0 K 2010-07-27 - 18:14 GonzaloMerino Cooling problem at PIC WN module causing about 50% of WNs to be shutdown (running jobs killed)
PDFpdf 20100727_SIR_PIC_DDN.pdf r1 manage 63.4 K 2010-07-27 - 17:25 GonzaloMerino SRM ATLAS problems at PIC on 22-Jul due to wrong dCache configuration. About 10h.
PDFpdf 20100727_SIR_PIC_Gridmapdir.pdf r1 manage 48.0 K 2010-07-27 - 17:23 GonzaloMerino CE failure at PIC of 3hrs on 20-Jul due to a faulty gridmapdir migration.
PDFpdf 20100831_SIR_ASGC_STAGERFTS_DB.pdf r2 r1 manage 246.7 K 2010-09-13 - 18:26 JhenWeiHuang 20100831_SIR_ASGC_STAGERFTS_DB.pdf
PDFpdf 20100924_SIR_ASGC_DCPOWERCUT.pdf r1 manage 249.2 K 2010-10-16 - 22:40 JhenWeiHuang 20100924_SIR_ASGC_DCPOWERCUT
PDFpdf 20110211_SIR_PIC_ATLAS_lost_files.pdf r1 manage 45.7 K 2011-02-11 - 13:29 GonzaloMerino Incident with ATLAS lost files at PIC 21/1/2011
PDFpdf 20110310_SIR_PIC_ATLAS_lost_Files.pdf r1 manage 74.2 K 2011-03-10 - 13:44 GonzaloMerino Update to the PIC SIR of lost files with ATLAS (21-Jan-2011 to 8-Feb-2011)
PDFpdf 20110501_SIR_ASGC_10GbLINKDOWN.pdf r1 manage 187.2 K 2011-05-11 - 19:03 JhenWeiHuang 20110501_SIR_ASGC_10GbLINKDOWN.pdf
PDFpdf 20110521_SIR_ASGC_DCPOWERCUT.pdf r1 manage 114.7 K 2011-05-27 - 07:56 JhenWeiHuang SIR for ASGC DC Power Cut on 21 May 2011
PDFpdf 20110727GGUS_Service_Incident_Report.pdf r1 manage 51.0 K 2011-07-27 - 12:00 DirkDuellmann  
PDFpdf 20120122SIRPowerandCoolingProblematPIC.pdf r1 manage 313.2 K 2012-02-03 - 14:50 GonzaloMerino SIR of the power and cooling incident at PIC Jan 22nd 2012
Microsoft Word filedoc 20120310SIRATLASlostfileonLTO5tapeG05918.doc r1 manage 34.0 K 2012-04-13 - 21:16 AlexeySedov ATLAS Tape incident at PIC
PDFpdf 20120603_SIR_Cooling_Incident_at_PIC.pdf r1 manage 54.4 K 2012-06-04 - 23:58 GonzaloMerino Cooling incident at PIC on 3-Jun-2012: Computing service degraded
PDFpdf 20120621_SIR_Cooling_Incident_at_PIC.pdf r1 manage 86.0 K 2012-06-28 - 11:57 GonzaloMerino Cooling incident at PIC 21-Jun-2012: 17% of WNs switched off
PDFpdf 20120729_SIR_ASGC_STAGERDB.pdf r1 manage 293.2 K 2012-11-21 - 18:55 JhenWeiHuang 20120729_SIR_ASGC_STAGERDB
PDFpdf 20121009_PIC_SIR_ATLAS_deleted_files.pdf r1 manage 61.3 K 2012-10-26 - 13:40 GonzaloMerino SIR for accidental ATLAS files deletion at PIC
PDFpdf 20121210SIRPICLHCblostfilesontape2.pdf r1 manage 58.7 K 2012-12-14 - 15:54 GonzaloMerino SIR for the lost LHCb tape files at PIC on Dec 2012
PDFpdf 20170321_SIR_CERN_PHEDEX.pdf r1 manage 49.0 K 2017-04-03 - 13:24 KateDziedziniewicz CMS Phedex not working at CNAF/WISCONSIN after CMSR migration
PDFpdf 3D-DB-incident-20100629.pdf r1 manage 43.4 K 2010-06-30 - 20:22 FelixLee ASGC 3D DB incident report 20100629
PDFpdf ASGC-DB-Sep28.pdf r1 manage 22.5 K 2009-10-12 - 17:05 JamieShiers  
PDFpdf ASGC-SIR20130324-Atlas_file_lost.pdf r2 r1 manage 31.4 K 2013-05-08 - 00:40 FelixLee ASGC file loss to Atlas MCTAPE
PDFpdf ASGC_DATA_LOSS_SIR-NOV_2013.pdf r1 manage 36.5 K 2013-11-21 - 14:41 FelixLee ASGC_DATA_LOSS_SIR-NOV2013
PDFpdf ASGC_SIR_2012-04-11.pdf r1 manage 281.0 K 2012-05-03 - 10:35 JhenWeiHuang ASGC_SIR_2012-04-11.pdf
PDFpdf ASGC_incident_report_Jan18_2010.pdf r2 r1 manage 16.6 K 2010-02-02 - 02:56 HorngLiangShih  
PDFpdf CCIN2P3-WLCGT1SCM-LHCB-SW-Problem-Report-20101111.pdf r1 manage 78.3 K 2011-01-18 - 10:40 JamieShiers CCIN2P3 Shared s/w area interim report
PDFpdf Fibre_Cut_June_2009.pdf r1 manage 177.9 K 2009-07-06 - 08:30 JamieShiers  
Microsoft Word filedoc GOCDB_Outage_5th_March_2014.doc r1 manage 30.5 K 2014-03-13 - 15:52 MaartenLitmaath GOCDB Outage 5th March 2014
PDFpdf GridKa_SIR_20100612.pdf r1 manage 28.5 K 2010-06-15 - 15:19 UnknownUser CMS dCache down for approx. 3h15
PDFpdf GridKa_SIR_20100706.pdf r1 manage 34.2 K 2010-07-07 - 23:44 JosVanWezel  
PDFpdf GridKa_SIR_PBS-Jan11.pdf r1 manage 47.7 K 2011-02-07 - 14:57 AndreasHeiss SIR about GridKa local batch system problems, January 2011
PDFpdf GridKa_SIR_lost_files_alice_20110526.pdf r1 manage 8.2 K 2011-06-06 - 17:23 JosVanWezel KIT SIR loast files ALICE 5/2011
PDFpdf GridKa_Service_Incident_Report_12082011.pdf r1 manage 461.9 K 2011-12-12 - 15:00 XavierMol  
PDFpdf KDC-SIR.pdf r2 r1 manage 66.1 K 2011-08-23 - 14:52 DirkDuellmann  
PDFpdf KIT_SIR_CMSChimeraDatabase_2018-08.pdf r1 manage 196.1 K 2018-08-20 - 10:15 XavierMol Database incident CMS dCache Aug 2018
PDFpdf KIT_SIR_StorageFTS_20121127.pdf r1 manage 298.0 K 2013-01-22 - 16:01 XavierMol SIR about offline FTS and dCache pool nodes end of Nov 2012 at GridKa.
PDFpdf KIT_SIR_Storage_20131028.pdf r2 r1 manage 429.2 K 2014-04-08 - 08:29 XavierMol 130 files lost for CMS
PDFpdf KIT_SIR_Storage_20141023.pdf r1 manage 203.8 K 2014-10-31 - 13:06 ThomasHartmann KIT: SIR: identification of file losses fro tape due to wrong end of tape markers
PDFpdf KIT_SIR_TapeStorage_2017-12.pdf r1 manage 195.6 K 2018-03-13 - 08:57 XavierMol SIR KIT Tape Storage Q4 2017
PDFpdf LHCb_Databases_Upgrade_Migration_Incident_report.pdf r1 manage 43.1 K 2018-03-21 - 18:27 IgnacioCoterillo  
Unknown file formatdocx POSTmortem-CMS-Oct2010.docx r1 manage 117.8 K 2010-10-15 - 13:51 MaartenLitmaath CMS storage down at CNAF Oct 6-10, 2010
Microsoft Word filedoc PostMortemTier-1ServiceIncidentRAIDCORRUPTIONAdaptec644515-03-2012.doc r1 manage 52.5 K 2012-04-13 - 21:13 AlexeySedov ATLAS Data Loss Incident at PIC
PDFpdf Post_Mortem_PIC_Tier-1_SIR_Computing_SSC5_20110525.pdf r1 manage 94.9 K 2011-06-01 - 16:18 UnknownUser SIR for the computing incident at PIC on 25/26th May 2011
PDFpdf Post_Mortem_Tier-1_Service_Incident_dCache_PNFS_overload_10-June-2011.pdf r1 manage 129.5 K 2011-06-14 - 17:14 UnknownUser  
PDFpdf Post_Mortem_Tier-1_Service_Incident_dCache_PNFS_overload_10-June-2011_f.pdf r1 manage 129.9 K 2011-06-16 - 15:53 UnknownUser  
PDFpdf Post_mortem_LFC_indicent_23-26_May_2009_-_WikiPIC.pdf r1 manage 163.7 K 2009-05-27 - 17:28 JamieShiers  
PDFpdf SIR-2018-CCIN2P3-DiskServerFailure.pdf r1 manage 416.0 K 2018-10-05 - 16:26 EricFede SIR for CCIN2P3 Data lost on xrootd storage
PDFpdf SIR-ALICE-KIT-overload-v2.pdf r1 manage 78.8 K 2014-05-07 - 18:52 MaartenLitmaath SIR about KIT firewall and OPN overload by ALICE jobs
PDFpdf SIR-CNAF--AtlasSRMoutage-April-2010.pdf r1 manage 112.5 K 2010-05-10 - 14:22 HarryRenshall CNAF ATLAS SRM blockage 28 April then MCDISK full STORM bug
PDFpdf SIR-FZK-20090907.pdf r1 manage 74.9 K 2009-09-29 - 14:42 HarryRenshall SIR of FZK degraded ATLAS RAC 7 to 16 Sep 2009
PDFpdf SIR-IN2P3-CC-AFSoutage-2010-04-26.pdf r1 manage 12.0 K 2010-05-07 - 11:14 HarryRenshall SIR for IN2P3 AFS Outage
PDFpdf SIR-IN2P3-CC-BatchOutage-2010-04-24.pdf r1 manage 15.4 K 2010-05-04 - 09:48 HarryRenshall SIR of IN2P3 batch outage of 24/25 April 2010
PDFpdf SIR-IN2P3-CC-CVMFS-2012-07-03-v0.pdf r1 manage 6.9 K 2012-07-18 - 23:06 MaartenLitmaath IN2P3-CC CVMFS inconsistency
PDFpdf SIR-IN2P3-CC-CVMFSSquid-2012-06-24-v2.pdf r1 manage 8.7 K 2012-08-29 - 22:17 MaartenLitmaath software area unavailable at IN2P3 on 24-Jun-2012
PDFpdf SIR-IN2P3-CC-LHCb-AFS-Latency-2010-S2-v2.pdf r1 manage 212.3 K 2011-02-14 - 22:14 MaartenLitmaath Slow AFS response causing environment setup timeout for LHCb jobs
PDFpdf SIR-IN2P3-CC-Network-2011-02-13-v0.pdf r1 manage 6.8 K 2011-03-01 - 15:45 MaartenLitmaath IN2P3-CC core network switch outage due to CPU card failure
PDFpdf SIR-IN2P3-CC-Network-2011-03-14-v1.pdf r1 manage 6.2 K 2011-03-25 - 16:07 MaartenLitmaath IN2P3-CC hardware failure on network equipment
PDFpdf SIR-IN2P3-CC-OperationsPortal-2010-04-22v2.pdf r2 r1 manage 17.2 K 2010-05-07 - 11:14 HarryRenshall SIR for IN2P3 Downtimes Notification Impossible
PDFpdf SIR-IN2P3-CC-PowerIncident-2011-04-08-v0.pdf r1 manage 8.1 K 2011-04-14 - 11:29 MaartenLitmaath IN2P3-CC power incident Apr 8
PDFpdf SIR-IN2P3-CC-PowerIncident-2011-08-26-v2.pdf r1 manage 24.3 K 2011-09-14 - 20:50 MaartenLitmaath IN2P3-CC cooling system failure Aug 26
PDFpdf SIR-IN2P3-CC-WNs-disconnected-2010-02-15-2.pdf r1 manage 10.5 K 2010-02-25 - 14:28 HarryRenshall Worker node network connectivity loss at IN2P3 15 Feb 2010
PDFpdf SIR-IN2P3-CC-dCache-2012-07-01-v1.pdf r1 manage 6.7 K 2012-07-18 - 22:59 MaartenLitmaath IN2P3-CC dCache downtime due to leap second
PDFpdf SIR-IN2P3-CC-lbms-DB-overload-2010-01-04.pdf r1 manage 30.1 K 2010-01-11 - 16:08 DirkDuellmann IN2P3 Local batch system database server overload
PDFpdf SIR-IN2P3-CC-network-2012-06-29-v0.pdf r1 manage 5.7 K 2012-07-16 - 20:04 MaartenLitmaath IN2P3-CC network outage
PDFpdf SIR-IN2P3-CC-network-2014-11-26-v0.pdf r1 manage 31.6 K 2014-12-01 - 10:00 AndreaSciaba  
PDFpdf SIR-IN2P3-CC-network-2015-11-03-v3.pdf r1 manage 33.1 K 2015-11-12 - 14:18 AndreaSciaba  
PDFpdf SIR-IN2P3-Dcache-ATLAS-Transfer-Degradation-2010-Q4-v3.pdf r1 manage 281.6 K 2011-02-11 - 19:27 MaartenLitmaath IN2P3-CC dCache transfer degradation for ATLAS
PDFpdf SIR20120921.pdf r1 manage 31.9 K 2012-10-16 - 18:31 MaartenLitmaath CNAF LHCb SE 6d downtime
PDFpdf SIR_201705.pdf r1 manage 127.2 K 2017-06-06 - 12:11 MaartenLitmaath GGUS outage of 2017-05-31
PDFpdf SIR_ASGC_July_2012.pdf r1 manage 292.8 K 2012-11-21 - 18:42 JhenWeiHuang SIR_ASGC_July_2012
PDFpdf SIR_BNL_CONDB.pdf r1 manage 58.3 K 2011-09-29 - 15:12 MariaGirone  
PDFpdf SIR_BNL_DB_CFG.pdf r2 r1 manage 50.6 K 2011-09-20 - 10:01 MariaGirone  
PDFpdf SIR_CCIN2P3_15aug2011.pdf r1 manage 32.8 K 2011-08-22 - 17:12 JamieShiers  
PDFpdf SIR_CCIN2P3_19july2011.pdf r1 manage 37.0 K 2011-08-01 - 15:53 MaartenLitmaath IN2P3-CC database incidents due to disk drive failures
Microsoft Word filedoc SIR_CCIN2P3_SRM_incident_08oct2009.doc r1 manage 71.5 K 2009-10-12 - 14:22 JamieShiers  
Microsoft Word filedoc SIR_CCIN2P3_cooling_outage_03nov2009.doc r1 manage 12.5 K 2009-11-06 - 17:37 DirkDuellmann IN2P3 cooling outage Nov 3rd
PDFpdf SIR_CNAF_20190829.pdf r1 manage 49.9 K 2019-08-29 - 18:42 MaartenLitmaath CNAF site outage Aug 6-21, 2019
PDFpdf SIR_COOLING_OUTAGE_2009_05_03.pdf r1 manage 26.7 K 2009-05-22 - 14:05 HarryRenshall SIR for PIC cooling failure of 14 May 2009
PDFpdf SIR_FZK-LCG2_2010-01-13.pdf r1 manage 28.5 K 2010-01-15 - 12:58 UnknownUser SIR FZK-LCG2 (GridKa/KIT) - Information system problems on 13th and 14th of January 2010
PDFpdf SIR_GRID-FTP_OUTAGE_2009_06_11-1.pdf r1 manage 73.9 K 2009-06-16 - 11:06 JamieShiers  
PDFpdf SIR_PIC_ATLAS_T10KD_20160519.pdf r1 manage 24.3 K 2016-05-19 - 10:05 AreshVedaee T10KD issue at PIC affecting ATLAS
PDFpdf SIR_PIC_COOLING_OUTAGE_2009_04_14.pdf r1 manage 32.0 K 2009-05-22 - 14:21 HarryRenshall SIR for PIC cooling failure of 2009.05.14
PDFpdf SIR_PIC_COOLING_OUTAGE_2009_05_14.pdf r1 manage 32.0 K 2009-05-22 - 14:26 HarryRenshall SIR for PIC Cooling Outtage of 14 May 2009
PDFpdf SIR_ROBOTIC_LIBRARY_OUTAGE_2009_04_22.pdf r1 manage 22.8 K 2009-04-25 - 10:06 DirkDuellmann  
PDFpdf SIR_ROBOTIC_LIBRARY_OUTAGE_2009_04_26-3.pdf r1 manage 17.6 K 2009-04-30 - 11:50 JamieShiers  
PDFpdf SIR_SARA_TAPEBACKEND_OUTAGE_2009_05_04.pdf r1 manage 22.0 K 2009-05-07 - 15:27 HarryRenshall SIR for SARA Tapebackend outage 4 to 6 May 2009
PDFpdf SIR_cooling_failure_20100710.pdf r1 manage 53.4 K 2010-07-19 - 14:28 UnknownUser SIR of the cooling incident at KIT on July 10
PDFpdf SIR_storage_FZK_GridKa.pdf r1 manage 51.7 K 2009-07-02 - 14:17 JamieShiers  
PDFpdf SIRondatalossinASGCinOct.2016.pdf r1 manage 32.1 K 2016-11-11 - 14:21 MaartenLitmaath ASGC - loss of ATLAS data, 18 Oct 2016
Unknown file formatxlsb SIRs-by-Q-2012.xlsb r1 manage 43.8 K 2012-11-23 - 14:06 JamieShiers Spreadsheet for producing SIR plots for WLCG QRs
PDFpdf SURFsara_SIR_network_outage_30-6-2016.pdf r1 manage 57.0 K 2016-07-13 - 14:36 RonTrompert1  
PDFpdf SURFsara_Service_Incident_Report_-_bw32-1_backplane.pdf r1 manage 4267.2 K 2015-02-09 - 16:58 AndreaSciaba  
PDFpdf Service_Incident_Report.pdf r1 manage 177.2 K 2014-01-14 - 12:09 SimoneCampana Service instabilities in the SURFsara grid storage cluster
PDFpdf Service_Incident_Report_for_BNL_Tier1-06-2013.pdf r1 manage 28.3 K 2013-06-26 - 21:56 MichaelErnst Service Incident Report for US ATLAS Tier-1 Center
PDFpdf Storage_incident_report_at_TRIUMF_Sep-16-2013.pdf r1 manage 46.6 K 2013-09-25 - 00:46 RedaTafirout TRIUMF incident report (lost files)
PDFpdf TRIUMF-dcs08lun0_incident_20161218.pdf r1 manage 41.7 K 2017-01-25 - 18:05 DiQing ATLAS lost files at TRIUMF due to hardware/firmware issue on December 18 2016
PDFpdf TRIUMF-incident-report-april10-2012.pdf r1 manage 29.8 K 2012-04-27 - 02:36 RedaTafirout TRIUMF incident report
PDFpdf post-mortem-CNAF-CE-Problem-Sept-2016.pdf r1 manage 141.2 K 2016-10-17 - 20:22 MaartenLitmaath  
Texttxt power_cut_ASGC.txt r1 manage 0.6 K 2009-07-31 - 16:19 GangQin power cut at ASGC on July 17th
Texttxt power_surge_ASGC_20090118.txt r1 manage 0.8 K 2010-02-01 - 12:59 GangQin Po
PDFpdf sir-in2p3-cc-dcachesrmincident-2011-03-19-v2.pdf r1 manage 7.1 K 2011-03-28 - 14:08 MaartenLitmaath IN2P3-CC dCache SRM overload
PDFpdf sir-in2p3-cc-powerincident-2011-02-25-v0.pdf r1 manage 7.8 K 2011-03-07 - 19:18 MaartenLitmaath IN2P3-CC power incident Feb 25
PDFpdf sir-kit-atlas-dcache-20110728.pdf r1 manage 25.9 K 2011-07-28 - 14:18 AndreasPetzold SIR ATLAS dCache data loss at KIT July 2011
PDFpdf sir_BatchIncident_15_10_09.pdf r1 manage 29.9 K 2009-10-15 - 16:07 JamieShiers  
PDFpdf sir_in2p3network_outage_10_12_2009.pdf r1 manage 48.8 K 2009-12-14 - 10:01 HarryRenshall SIR of IN2P3 DNS Load Balancing Failure 8 December 2009
PDFpdf uscmsT1_SIR_042015.pdf r2 r1 manage 46.7 K 2015-05-04 - 15:00 LucaMascetti 2015-05 FNAL uscms lost files
Edit | Attach | Watch | Print version | History: r287 < r286 < r285 < r284 < r283 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r287 - 2019-09-27 - JuliaAndreeva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback