Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN-PROD | Middleware | Oct 31 | 1 day | IAM-ATLAS | HTCondor CE job submission timeouts worldwide | WLCG_AuthZ_Meeting_-_ATLAS_IAM_Outage.pdf |
RAL-LCG2 | Network | Oct 17-28 | 11 days | all | outages and degradation |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
FZK-LCG2 | Storage | Mar 18, 2022 | 3d | dCache and xrootd SE | Abrupt outage of all grid SEs due to network intervention | KIT_SIR_OnlineStorage_2022-03.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN-PROD | Middleware | Aug 20-21, 2012 | 6h | Rucio auth service | service unreachable from outside CERN, all ATLAS distributed computing activities stuck | RucioAuthSvcInc20210820 |
CERN-PROD | Middleware | July 9 and 25, 2021 | several days | FTS-ATLAS | service dysfunctional, traffic redirected to BNL FTS | FTS service incident report |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN-PROD | Middleware | June 24-25, 2020 | 24h | CERN Grid CA OCSP service | CERN Grid CA certificates could not be used for job submission to CREAM CE instances at tens of sites, affecting the 4 experiments and, through the SAM tests, those sites | OTG:0057432![]() CERN_OCSP_incident_report.pdf |
CERN-PROD | Databases | May 27, 2020 | 1 day + 5 days for ATLAS replication to T1 sites |
many | Many site and experiment services affected | DB post-mortem 27 May 2020 |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN-PROD | Infrastructure, Storage, Middleware | Feb 20, 2020 | 9 hours | many | Many site services affected, all grid computing resources unavailable | CERNProdIncident200220 |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CNAF | Infrastructure | Aug 6-21, 2019 | 15 days | all | all computing resources and data unavailable | SIR_CNAF_20190829.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN | Storage | Jun-Sep 2018 | 3 months | EOS | instabilities; some data loss | EOS report |
KIT | Storage | 9-10 Aug 2018 | 1 day | dCache | Lost (at least) 270k files for CMS. | SIR.pdf |
IN2P3-CC | Storage | Aug 2018 | - | XRootD | 110 TB of ALICE data lost due to RAID problem | SIR.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN | Databases | Feb 2018 | 5 days | LHCb | Service degraded | LHCb.pdf[SIR.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
KIT | Tape Storage | Dec 2017 | Tape Archive | 4300 files lost total | SIR.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
KIT | Infrastructure | 31 May 2017 | 6h | GGUS | service unavailable | SIR_201705.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN | Database though the problem rather related to network | 21 Mar 2017 | 48h | CMSR | Phedex downtime in CNAF and Wisconsin | 20170321_SIR_CERN_PHEDEX.pdf |
KIT | Storage | 12 Jan 2017 | - | dCache/TSM | 7185 ATLAS, 75 LHCb and 2 CMS files lost |
KIT_SIR_Lost_files_after_TSM_DB_storage_crash.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
TRIUMF | Storage | 18 December 2016 | - | dCache | Unrecoverable data loss | TRIUMF-dcs08lun0_incident_20161218.pdf |
ASGC | Storage | 18 Oct 2016 | - | DPM | 135k ATLAS files (20 TB) lost due to RAID failure |
SIRondatalossinASGCinOct.2016.pdf |
INFN-T1 | Middleware | 1 Oct 2016 | 3.5 days | CREAM | jobs had no valid proxy on the WN, particularly impacting LHCb |
post-mortem-CNAF-CE-Problem-Sept-2016.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN | Middleware | 15 Sep 2016 | 33h | LSF batch system, CREAM | jobs could not be submitted, strongly impacting ALICE and LHCb |
https://twiki.cern.ch/twiki/bin/view/CMgroup/BatchServiceIncident150916 |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
PIC | Storage | 17 December 2015 | - | Tape Storage | a T10KD drive writing off track made several files unreadable | SIR_PIC_ATLAS_T10KD_20160519.pdf |
SARA | Infrastructure | 30 June 2016 | 26 hours | Compute and storage | Outage | SURFsara_SIR_network_outage_30-6-2016.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN | Infrastructure | 29 Mar | 2 days | VOMS | ATLAS, CMS and LHCb affected, several experiment services affected, FTS transfers affected | Report |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
IN2P3-CC | Network | 3 Nov | 1h | Network | The router connecting the site to the outside world broke and all external network connections stopped working | SIR-IN2P3-CC-network-2015-11-03-v3.pdf |
CERN | Batch | 5 Dec | 6h | Batch services | loss of running jobs, degraded capacity | IncidentBatchWorkerNodes |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN | Infrastructure | 9 Jul | 2h | CVMFS | All CMS Jobs failed on WLCG | IncidentCvmfsCMS150709 |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
FNAL | Storage | April 15 | 15 days | dCache | Unrecoverable data loss | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/uscmsT1_SIR_042015.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
SARA | Storage | January 15 | 25 days | dCache | Unrecoverable data loss | SURFsara_Service_Incident_Report_-_bw32-1_backplane.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
IN2P3-CC | Network | November 26 | 1.6 hours | VOBoxes | Various internal services and VOBoxes were cut off the network | SIR-IN2P3-CC-network-2014-11-26-v0.pdf |
CERN | Storage | October 11 | 5 hours | CASTOR | Outage: backend daemon of the SRM service stopped talking with the CASTOR database | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsSRMCMS20141011 |
CERN | Storage | October 14 | 4 hours | CASTOR | Outage: backend daemon of the SRM service stopped talking with the CASTOR database | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsSRMCMS20141016 |
KIT | Storage | September 30 | - | Tape Storage | due to wrong tape markers loss of 424 files | KIT_SIR_Storage_20141023.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
KIT | Network | Apr 1 | 3 weeks | Network | many job and data transfer failures for all 4 experiments, due to firewall and OPN overload by ALICE jobs | SIR-ALICE-KIT-overload-v2.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
RAL | Infrastructure | Mar 5 | 16h | GOCDB | topology and downtimes unavailable | GOCDB_Outage_5th_March_2014.doc |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
KIT | Storage | Nov 18 | - | tape archive | 28 CMS files lost | A broken tape was spotted, but 28 of its files could not be found cached on disk or at other sites anymore. |
CERN | Storage | Nov 5 | - | EOS-CMS | 78k files lost, 15 TB, 28 users affected | https://twiki.cern.ch/twiki/pub/EOS/IncidentsEOSCMSRecursiveRm20131105/20131105-EOSCMS-Service-Report.pdf (the incident reported is not considered a service incident) |
ASGC | Storage | November 4 | - | Disk Storage | Lost Data (approx 1M files, 140TB of data from ATLASDATADISK) | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC_DATA_LOSS_SIR-NOV_2013.pdf |
KIT | Storage | October 28th | - | Disk storage, tape archival | Lost data | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/KIT_SIR_Storage_20131028.pdf |
NL-T1 | Storage | October 24th | 2 months | grid storage cluster | Unavailability + Data loss (45 files) | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Service_Incident_Report.pdf |
CERN | Middleware | October 7 | 4h | VOMS | Proxy creation and renewal failures, large amounts of job and data transfer failures across WLCG | https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentVOMSOct2013 |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN | Infrastructure | Sep 18 | 8h | VOBOXes, LFC, FTS, DB | various central services of ATLAS, CMS and LHCb impaired, transfer failures, data access errors | https://twiki.cern.ch/twiki/bin/viewauth/PESgroup/IncidentSCSSet2013 |
TRIUMF | Storage | September 16 | - | Disk Storage | Lost Data | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Storage_incident_report_at_TRIUMF_Sep-16-2013.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
BNL | Storage | June 21 | - | Disk Storage | Lost Data | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/Service_Incident_Report_for_BNL_Tier1-06-2013.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
ASGC | Storage | Mar 27 | - | CASTOR | lost 55 files in Atlas MCTAPE | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC-SIR20130324-Atlas_file_lost.pdf |
CERN | Infrastructure | Feb 21-22 | 16h | VOMS | significant number of job and/or data transfer failures for all experiments throughout WLCG | VOMS incident Feb 2013 |
CERN | Batch Computing | Feb 10 | 8h | Batch | Batch system was down (unavailable for users), then dispatched jobs too slowly | LSF Master Daemon Crash and Slow Dispatch Issue |
CERN | Storage | Jan 22 | 8h | CASTOR | CASTOR DB overload causing transfer slowness, mainly affecting CMS | CASTOR DB loads |
CERN | Infrastructure | Jan 19 | 9h | all services relying on grid certificates, at CERN and elsewhere |
many grid services unavailable to many users, large number of jobs lost |
CERN CA CRL incident |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
PIC | Storage | Dec 10 | - | dCache | LHCb tape deleted (2 unique files lost) | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20121210SIRPICLHCblostfilesontape2.pdf |
GridKa | File Transfer, Storage Element | Nov 27th | 20 hours | FTS, dCache, LFC, CondDB | German cloud down for transfers (FTS users) | KIT_SIR_StorageFTS_20121127.pdf |
RAL | all | Nov 20 | 50h | all | T1 services unavailable | https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20121120_UPS_Over_Voltage![]() |
RAL | all | Nov 7 | 27h | all | T1 services unavailable, 166 ATLAS files lost | https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20121107_Site_Wide_Power_Failure![]() |
CERN | Storage | Oct 16 | 4h | CASTOR | CASTORCMS severely degraded due to unstable DB execution plan | IncidentsCMSOverload20121016 |
PIC | Storage | Oct 9 | - | dCache | Accidental ATLAS data deletion | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20121009_PIC_SIR_ATLAS_deleted_files.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CNAF | Storage | Sep 21-27 | 6d | StoRM | LHCb data unavailable and queue closed | SIR20120921.pdf |
CERN | Storage | 7 Sep | n/a | EOSCMS | accidental user deletion of 1PB of data | report pending |
ASGC | Storage | July 29 - Aug 07 | 10d | CASTOR | ATLAS and CMS transfer efficiency to Taiwan degraded. T0 export stopped | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120729_SIR_ASGC_STAGERDB.pdf |
CERN | Infrastructure | ~all quarter | on-going | LSF | slow job submission critically affecting ATLAS T0. Dispatch issues affecting ATLAS T0. | https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlowResponse2012 ongoing https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlowDispatch2012 |
IN2P3 | Infrastructure | 3-4 Jul | 21h | CVMFS | ATLAS and LHCb job failures | SIR-IN2P3-CC-CVMFS-2012-07-03-v0.pdf |
IN2P3 | Storage | 1-2 Jul | 30h | dCache | job and transfer failures, batch on hold | SIR-IN2P3-CC-dCache-2012-07-01-v1.pdf |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
IN2P3 | Network | 29 Jun | 4 h | Network | All outside connectivity lost | SIR-IN2P3-CC-network-2012-06-29-v0.pdf |
IN2P3 | Infrastructure | 24 Jun | 36 h | CVMFS at IN2P3 | ATLAS and LHCb jobs crashed, dCache overload by CMS jobs | SIR-IN2P3-CC-CVMFSSquid-2012-06-24-v2.pdf |
PIC | WNs | 21 Jun | 1 h | PIC Tier1 Computing | About 17% of the WN capacity switched off due to cooling incident | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120621_SIR_Cooling_Incident_at_PIC.pdf |
CERN | Storage | 18 Jun | ~1h | CASTOR | c2atlas diskservers were not reachable for ~1h | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsRmNodeMisconfiguration20120618 |
CERN | Storage | 5 Jun | 1 h | CASTOR | communication problems and client timeouts | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsNameServerContention20120605 |
PIC | WNs | 3-4 Jun | 18 h | PIC Tier1 Computing | 18h of service degradation: Number of cores reduced by 60% due to cooling incident | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120603_SIR_Cooling_Incident_at_PIC.pdf |
CERN | DB | 22 May | 1.5 h | CMS online DB | 1.5 hours of high luminosity data lost | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem22May12 |
CERN | Storage | 22 May | 5-40 min | CASTOR | ~1k unavailable files after transparent DB intervention | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsDegradationDBIntervention20120522 |
CERN | Infrastructure | 19-20 April | 1 day | batch | batch system down | https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchDown190412 |
CERN | Infrastructure | 18-20 April | 2 days | batch | ATLAS Tier-0 job submission system could not keep up with incoming RAW data | https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlow180412 |
ASGC | Storage | 11-12 April | 24 h | CASTOR | hardware failure, DB crashed | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC_SIR_2012-04-11.pdf |
TRIUMF | All Tier-1 services | 10-11 April | 20 h | All Tier-1 services | Two site-wide power failures | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/TRIUMF-incident-report-april10-2012.pdf |
CERN | Storage | 4 April | 1.5 h | CASTOR | Name Server stuck, 3 CMS files had to be rewritten | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCentralNSStuck20120404 |
CERN | Storage | 2 April | many days (~10) | CASTOR | 1 LHCb diskserver hardware issue (files unavailable, finally 3 file systems lost) | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsDiskOnlyDataLoss20120402 |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
CERN | Compute | 17/18 Dec 2011 | 18 hours | CERN batch service | Batch service downtime (unavailable for users) | IncidentBatch171211 |
KIT | Storage | Dez 2011 | 3 Months | tape archival | 2 lost files | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_Service_Incident_Report_12082011.pdf |
KIT | Infrastructure | Nov 4-7 | 2.5 days | GGUS external interfaces | No ticket updates entered other ticketing systems including SNOW at the T0 | https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRSNOWinterfacefailure20111104 |
RAL | Database (was Storage) | Oct 22-23 | 1.5 days | CASTOR DB | CASTOR down | https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20111022_Castor_Outage_RAC_Nodes_Crashing![]() |
CERN | DB | Oct 11 | GGUS alarms | GGUS alarm to IT-DB workflow | GGUSalarmToITDBworkflowPostmortemReport11102011 | |
CERN | DB | Oct 11-12 | ATLAS Offline (ATLR) | ATLAS Offline database (ATLR) high load | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem12Oct11 | |
KIT | Network | Oct 6 | 24h | GGUS | Ticketing systems at the T0 & some T1s couldn't get GGUS updates. | https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRDNSfailure20111006 |
Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
IN2P3 | Storage | Mar 19 | 3.5h | SRM | dCache SRM was unusable due to internal overload | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir-in2p3-cc-dcachesrmincident-2011-03-19-v2.pdf |
CERN | Infrastructure | Mar 19 | 12h | Batch system | Job submission became slow, then completely unresponsive | https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatch190302011 |
IN2P3 | Network | Mar 14 | 40min | Batch system | no connection to other French sites, but no problems observed for jobs | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-Network-2011-03-14-v1.pdf |
CERN | DB | 11-Mar-11 | 5h | CMS offline production db | The database was completely down for ~2 hours and partially not available for 5 hours | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem11Mar11 |
IN2P3 | Infrastructure | Feb 25-26 | 13h | Batch system | 85% of batch system unavailable, jobs lost | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/sir-in2p3-cc-powerincident-2011-02-25-v0.pdf |
IN2P3 | Storage | Feb 13 | 3 h | Storage service | Storage services degraded, no big impact on jobs | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-Network-2011-02-13-v0.pdf |
PIC | Storage | 21-Jan-11 to 08-Feb-11 | 18 days | Storage service | 250TB of ATLAS data partially unavailable | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20110310_SIR_PIC_ATLAS_lost_Files.pdf |
KIT | infrastructure | 28-Jan-11 to 02-Feb-11 | 5 days | Batch system, job submission | batch system degraded, reduced # of job slots available | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_PBS-Jan11.pdf |
CERN | DB | 25-Jan-11 | 8h | FTS, LFC, SAM, VOMS, dashboards | affected services fail, clients may hang | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem25Jan11 |
IN2P3 | infrastructure | 8-Jul-10 to 7-Jan-11 | 6 months | shared s/w area | jobs fail | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-LHCb-AFS-Latency-2010-S2-v2.pdf |
CNAF-BNL | network | 23-Aug-10 to 20-Jan-11 | months | primary OPN circuit | poor transfer performance; ok when switched to backup | in preparation |
Site | Date | Duration | Service | Area | Impact | Report |
---|---|---|---|---|---|---|
CERN | 18 Dec | 5 days | DB | DB | Service interruption: ATLARC DB following the power cut at CERN CC | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem18Dec10ATLARC |
CERN | 18 Dec | 26 hours for services with weight > 50 | power | infrastructure | Interruption of physics services following power cut | https://twiki.cern.ch/twiki/bin/view/FIOgroup/PowerCut101218 |
CERN | 16 Dec | 2.5h | DB | DB | ATLR database affected (degradation then complete outage) by FC switch replacement | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem16Dec10 |
CERN | 7 Dec | 7 days | CVS | infrastructure | CMSSW CVS migration problems | https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentCMSSW101210 |
CERN | Nov/Dec | 8 days | DB | DB | Reboots of Instance 4 of ATLR database | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem15Dec10 |
KIT | 26 Nov | 1.5h | GGUS | infrastructure | No web access / no ticket update | https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRGGUSfailure20101126 |
KIT | 16 Nov | 3.5h | GGUS | infrastructure | No web access/ no ticket update | https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRGGUSfailure20101116 |
IN2P3 | 11 Nov | months | AFS | storage/infrastructure | shared s/w area | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/CCIN2P3-WLCGT1SCM-LHCB-SW-Problem-Report-20101111.pdf |
NL-T1 | 26 Oct | 48h | DB | DB | Inconsistency of data at SARA | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem26Oct10 |
CERN | 20 Oct | 4.5 h | Batch | infrastructure | Severely degraded response from CERN Batch Service | https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatch201010 |
CNAF | 6 Oct | 5 days | CMS storage | storage | CMS storage down (service interruption) due to GPFS bug | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/POSTmortem-CMS-Oct2010.docx |
CERN | 4 Oct | 2.1 h | MyProxy | middleware/infrastructure | Outage on myproxy.cern.ch after incorrect certificate renewal |
https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentMyProxy041010 |
IN2P3 | Sep 23 - Nov 22 | 2 months | ATLAS file transfers | storage | Service degradation | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-Dcache-ATLAS-Transfer-Degradation-2010-Q4-v3.pdf |
Site | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|
ASGC | 24 Sep | 5 hours to recover almost services except 3D service wiich costs 3 weeks | DB | DC power cut | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100924_SIR_ASGC_DCPOWERCUT.pdf |
CERN | 13 Sep | 1.5h | CMSR DB | Spontaneous reboots of nodes 2 and 4 of CMSR | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem14Sep10 |
CERN | 10 Sep | 4 days | DB | Real time downstream was not set for LFC replication | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem15Sep10 |
SARA | August | >3weeks | DB | Replication for ATLAS conditions and LHCB conditions to SARA stopped | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem10Sep10 |
ASGC | 31 Aug | 4 days | DB | CASTOR outage due to STAGER DB problem | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100831_SIR_ASGC_STAGERFTS_DB.pdf |
NL-T1 | August | > week | DB | ATLAS NL-T1 cloud down, LHCb T1 site | http://sirs.grid.sara.nl/docs/NL-T1_SIR-20100818.pdf![]() |
CERN | 23 Aug | 35 h | Atlas conditions DB | ATLAS data streaming to Tier1 sites stopped | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem23Aug10 |
CERN | 20 Aug | 4h | CMS DB | Instability of node 3 and 4 of CMSR | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem20Aug10 |
CERN | 9 Aug | 16h | LHCb online | LHCBONR database unavailable | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem09Aug10 |
PIC | 25 Jul | 30h | CE | Service Degradation. Cooling problem causing about 50% of WNs to be shutdown (running jobs killed) | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100727_SIR_PIC_CoolingModule.pdf |
PIC | 22 Jul | 10h | SE | SRM service not available for ATLAS due to a problem with dCache pool costs configuration. | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100727_SIR_PIC_DDN.pdf |
PIC | 20 Jul | 3h | CE | Computing Service not available after SD due to a wrong gridmapdir migration. | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100727_SIR_PIC_Gridmapdir.pdf |
CERN | 19 Jul | 2h | several | Cooling failure in the vault | https://twiki.cern.ch/twiki/bin/view/FIOgroup/513Temp100719 |
OSG/GOC | 15 Jul | 4h | GOC | GOC Service outage | https://twiki.grid.iu.edu/bin/view/Operations/GOCServiceOutageJuly162010![]() |
CERN | 13 Jul | 1:30-9:15 | CMS DB | Few short interruptions of replication of CMS data from online to offline | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem13July10 |
KIT | 10 Jul | 4h + | site | Outage of central and local services due to a cooling failure | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_cooling_failure_20100710.pdf |
NL-T1 | 5 Jul | 1 week | SE | Reduced availability caused by data corruption | http://sirs.grid.sara.nl/docs/NL-T1_SIR-20100705.pdf![]() |
NDGF | 14 Jul | 16h | SE | srm.ndgf.org downtime followed by degradation | https://wiki.ndgf.org/display/ndgfwiki/20100714+dCache+server+failure![]() |
NDGF | 8 Jul | 3h | LFC | LFC downtime on lfc1.ndgf.org | https://wiki.ndgf.org/display/ndgfwiki/Operation-Reports-2010.07.08![]() |
KIT | 5 Jul | 18h | SE | CMS dCache SE down because of hardware failure | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_20100706.pdf |
Site | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|
RAL | 30 June | SE | 1083 CMS files were lost. | http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100630_Disk_Server_Data_Loss_CMS![]() |
|
CERN | 29 June | 4 h | CASTOR | CASTOR outage due to AFS | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCastorAFS29Jun2010 |
CERN | 29 June | 5 h | AFS | complete FC disk array - affected CASTOR and also LHC! | https://twiki.cern.ch/twiki/bin/view/AFSService/IncidentsArrayFailure29Jun2010 |
CERN | 28 June | 4+h | CASTOR | Log volume slowed down the Castor instances | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsSrmLogFlood28Jun2010 |
ASGC | 29 June | ~ 15 hours | 3D DB | ASGC didn't apply stream LCRs from central 3D DB for 15 hours | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/3D-DB-incident-20100629.pdf |
CERN | 26 June | ~50 min | ATLAS offline DB (ATLR) | 9 Oracle services did not fail over properly after a node eviction | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem26June10 |
CERN | 24,25 June | ~10h | LHCb Streaming | Streaming of LHCb data to PIC was not working during 10 hours, streaming to other Tier1 sites not working for 40 minutes | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem24June10 |
CERN | 22 June | CASTOR | LDAP high load caused CASTOR to become unresponsive | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsLdapOverloaded22Jun2010 | |
KIT/GridKa | 12 June | ~3:15h | CMS dCache | Service down | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/GridKa_SIR_20100612.pdf |
CERN | 7 June | ~3h | CREAM CE | Job submission failure | https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentCREAMCe070610 |
CERN | 2 June | 1 day | ATLAS and LHCb online and offline databases | Database access and quality of DB services compromised | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem02June10 |
CERN | 1 June | ~2h | ATLAS offline and LCGR databases | Database services unavailability during scheduled maintenance for rolling upgrade/patching | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem31May10 |
CERN | 31 May | ~2h | CMS online | Database services unavailability during scheduled maintenance for rolling upgrade/patching | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem31May10 |
CERN | 26 May | 10 days | CMS offline database | Hw failure affecting one node, cluster running at reduced capacity | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem26May10 |
PIC | 21 May | 19 hours | Whole site | Site power cut. Outage. | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20100521_SIR_PIC_PowerCut.pdf |
CERN | 14 May | - | CASTOR | Data loss from incorrect recycling | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsAliceRecycled14May2010 |
GGUS | 12 May | <=4.5 hours | .de domain | Domain does not exist | https://twiki.cern.ch/twiki/bin/view/LCG/VoUserSupport#SIRdeDNSfailure20100512 |
CERN-ASGC | 12-15 May | - | LHCOPN | Reduced bandwidth | SIRCernAsgcLinkMay2010 |
CNAF | 28 and 29 April | 9 hours & 12 hours | STORM | SRM blockage (hardware) followed by MCDISK full and STORM bug | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-CNAF--AtlasSRMoutage-April-2010.pdf |
IN2P3 | 26 Apr | 17.5 hours | AFS | Distributed File System (AFS) crashed after server overload. Batch also affected. | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-AFSoutage-2010-04-26.pdf |
IN2P3 | 24 Apr | 17 hours | Batch | services location service stopped responding to requests blocking most batch system commands | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-BatchOutage-2010-04-24.pdf |
IN2P3 | 20 Apr | 9 hours & 5 days | Grid Downtime Notification | Grid downtime notifications were impossible after two consecutive incidents | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-OperationsPortal-2010-04-22v2.pdf |
Site | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|
CERN | 3 Mar | 18 hours | DB Replication | Replication of LHCb conditions Tier0->Tier1, Tier0->online partially down | https://twiki.cern.ch/twiki/bin/view/PDBService/StreamsPostMortem#Replication_of_LHCb_conditions_T |
IN2P3 | 15 Feb | 4.25 hours | Batch | Local worker nodes lost network connectivity | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-WNs-disconnected-2010-02-15-2.pdf |
PIC | 10 feb | 7 hours | Spanish-CA CRLs expired at CERN | Complete blackout of services involving grid certificates either personal or host from Spanish CA at CERN: VOMS, FTS, etc. | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_Rediris_wLCG_formatted.pdf https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentCAProxy1202210 |
CERN | 7 Feb | 4 hours | Batch Tier-0 Atlas RTT cluster | Degraded service on RunTimeTester cluster due to misconfiguration | http://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatch0702210![]() |
CERN | 30 Jan | 2 days | CASTORATLAS | The xroot daemon was looping on the castoratlas name server because of a bug and slowing down all normal name server calls which was causing the migrator policy to fail | https://twiki.cern.ch/twiki/bin/viewauth/CASTORService/IncidentsMigrationBacklog01Feb2010 |
RAL | 29 Jan | 5 days | CASTOR - all instances | A scheduled outage to migrate the Castor Databases back to their original disk arrays encountered significant problems resulting in an extended outage. | http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100129![]() |
ASGC | 18 Jan | 2 days | power system | power surge for one second and most services were restarted | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC_incident_report_Jan18_2010.pdf |
GridKa/KIT | 13 Jan | 26 hours | site BDII and lcg-CE | site BDII query problems and missing lcg-CE information | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_FZK-LCG2_2010-01-13.pdf |
IN2P3 | 4 Jan | 6 hours | Batch | Local batch system database server overload......................... | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR-IN2P3-CC-lbms-DB-overload-2010-01-04.pdf |
Site | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|
CERN | 21 Sep 2009 | 08:00 - 18:00 | DB Replication | ATLAS Replication Tier0->Tier1 down | https://twiki.cern.ch/twiki/bin/view/PDBService/StreamsPostMortem |
RAL | 15 - 17 Sept 2009 | 2 days | CASTOR | Disk to Disk (D2D) transfers started failing during a planned upgrade to the NS | http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090915![]() |
FZK | 7 - 16 Sep 2009 | 10 days | ATLAS RAC | 3D Streams replication blocked then degraded | https://twiki.cern.ch/twiki/bin/viewfile/LCG/WLCGServiceIncidents?rev=1;filename=SIR-FZK-20090907.pdf |
CERN | 5 & 8 Sept 2009 | 2 * 2 hours | CASTOR LHCb | two Castor Database problems | https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090905 |
CERN | 26 Aug 2009 | 18:40 - 23:30 | Batch | Public and production queues closed | https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090826 |
ASGC | 17 Jul 2009 | 6:00 - 10:00 | Power cut | Most services went down and restarted | https://twiki.cern.ch/twiki/bin/viewfile/LCG/WLCGServiceIncidents?rev=1;filename=power_cut_ASGC.txt |
ATLAS | 13 Jul 2009 | 10:00 - 11:00 | Central Catalogs | Degrade of performance | PostMortem13Jul09 |
Site | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|
RAL | 29 Mar 09 | 33 hours | complete site | down | http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090324![]() |
RAL | 09 Mar 09 | 24 hours | DNS | all, especially SRM | http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20090309![]() |
CERN | 04 Mar 09 | 3 hours | CASTOR | down | https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090304 |
ASGC | 25 Feb 09 | days to weeks | many | down | Fire in UPS. Partial report on Tuesday in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek090406 |
I | Attachment | History | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|---|
![]() |
2009-11-26_CMS_CCIN2P3_Report.pdf | r1 | manage | 78.4 K | 2009-12-01 - 15:52 | DirkDuellmann | CMS Data Loss Incident at FR-CCIN2P3 |
![]() |
20090411_SIR_SRM_PIC.pdf | r1 | manage | 152.9 K | 2009-04-16 - 15:10 | OlofBarring | |
![]() |
20091219_PIC_Service_Incident_Report.pdf | r1 | manage | 23.3 K | 2009-12-23 - 11:20 | GonzaloMerino | SIR of the cooling incident at PIC on 19 Dec 2009 |
![]() |
20100521_SIR_PIC_PowerCut.pdf | r1 | manage | 123.4 K | 2010-05-31 - 09:39 | GonzaloMerino | SIR for the power cut affecting PIC Tier1 on 21-22 May 2010 |
![]() |
20100727_SIR_PIC_CoolingModule.pdf | r1 | manage | 75.0 K | 2010-07-27 - 18:14 | GonzaloMerino | Cooling problem at PIC WN module causing about 50% of WNs to be shutdown (running jobs killed) |
![]() |
20100727_SIR_PIC_DDN.pdf | r1 | manage | 63.4 K | 2010-07-27 - 17:25 | GonzaloMerino | SRM ATLAS problems at PIC on 22-Jul due to wrong dCache configuration. About 10h. |
![]() |
20100727_SIR_PIC_Gridmapdir.pdf | r1 | manage | 48.0 K | 2010-07-27 - 17:23 | GonzaloMerino | CE failure at PIC of 3hrs on 20-Jul due to a faulty gridmapdir migration. |
![]() |
20100831_SIR_ASGC_STAGERFTS_DB.pdf | r2 r1 | manage | 246.7 K | 2010-09-13 - 18:26 | JhenWeiHuang | 20100831_SIR_ASGC_STAGERFTS_DB.pdf |
![]() |
20100924_SIR_ASGC_DCPOWERCUT.pdf | r1 | manage | 249.2 K | 2010-10-16 - 22:40 | JhenWeiHuang | 20100924_SIR_ASGC_DCPOWERCUT |
![]() |
20110211_SIR_PIC_ATLAS_lost_files.pdf | r1 | manage | 45.7 K | 2011-02-11 - 13:29 | GonzaloMerino | Incident with ATLAS lost files at PIC 21/1/2011 |
![]() |
20110310_SIR_PIC_ATLAS_lost_Files.pdf | r1 | manage | 74.2 K | 2011-03-10 - 13:44 | GonzaloMerino | Update to the PIC SIR of lost files with ATLAS (21-Jan-2011 to 8-Feb-2011) |
![]() |
20110501_SIR_ASGC_10GbLINKDOWN.pdf | r1 | manage | 187.2 K | 2011-05-11 - 19:03 | JhenWeiHuang | 20110501_SIR_ASGC_10GbLINKDOWN.pdf |
![]() |
20110521_SIR_ASGC_DCPOWERCUT.pdf | r1 | manage | 114.7 K | 2011-05-27 - 07:56 | JhenWeiHuang | SIR for ASGC DC Power Cut on 21 May 2011 |
![]() |
20110727GGUS_Service_Incident_Report.pdf | r1 | manage | 51.0 K | 2011-07-27 - 12:00 | DirkDuellmann | |
![]() |
20120122SIRPowerandCoolingProblematPIC.pdf | r1 | manage | 313.2 K | 2012-02-03 - 14:50 | GonzaloMerino | SIR of the power and cooling incident at PIC Jan 22nd 2012 |
![]() |
20120310SIRATLASlostfileonLTO5tapeG05918.doc | r1 | manage | 34.0 K | 2012-04-13 - 21:16 | AlexeySedov | ATLAS Tape incident at PIC |
![]() |
20120603_SIR_Cooling_Incident_at_PIC.pdf | r1 | manage | 54.4 K | 2012-06-04 - 23:58 | GonzaloMerino | Cooling incident at PIC on 3-Jun-2012: Computing service degraded |
![]() |
20120621_SIR_Cooling_Incident_at_PIC.pdf | r1 | manage | 86.0 K | 2012-06-28 - 11:57 | GonzaloMerino | Cooling incident at PIC 21-Jun-2012: 17% of WNs switched off |
![]() |
20120729_SIR_ASGC_STAGERDB.pdf | r1 | manage | 293.2 K | 2012-11-21 - 18:55 | JhenWeiHuang | 20120729_SIR_ASGC_STAGERDB |
![]() |
20121009_PIC_SIR_ATLAS_deleted_files.pdf | r1 | manage | 61.3 K | 2012-10-26 - 13:40 | GonzaloMerino | SIR for accidental ATLAS files deletion at PIC |
![]() |
20121210SIRPICLHCblostfilesontape2.pdf | r1 | manage | 58.7 K | 2012-12-14 - 15:54 | GonzaloMerino | SIR for the lost LHCb tape files at PIC on Dec 2012 |
![]() |
20170321_SIR_CERN_PHEDEX.pdf | r1 | manage | 49.0 K | 2017-04-03 - 13:24 | KateDziedziniewicz | CMS Phedex not working at CNAF/WISCONSIN after CMSR migration |
![]() |
3D-DB-incident-20100629.pdf | r1 | manage | 43.4 K | 2010-06-30 - 20:22 | FelixLee | ASGC 3D DB incident report 20100629 |
![]() |
ASGC-DB-Sep28.pdf | r1 | manage | 22.5 K | 2009-10-12 - 17:05 | JamieShiers | |
![]() |
ASGC-SIR20130324-Atlas_file_lost.pdf | r2 r1 | manage | 31.4 K | 2013-05-08 - 00:40 | FelixLee | ASGC file loss to Atlas MCTAPE |
![]() |
ASGC_DATA_LOSS_SIR-NOV_2013.pdf | r1 | manage | 36.5 K | 2013-11-21 - 14:41 | FelixLee | ASGC_DATA_LOSS_SIR-NOV2013 |
![]() |
ASGC_SIR_2012-04-11.pdf | r1 | manage | 281.0 K | 2012-05-03 - 10:35 | JhenWeiHuang | ASGC_SIR_2012-04-11.pdf |
![]() |
ASGC_incident_report_Jan18_2010.pdf | r2 r1 | manage | 16.6 K | 2010-02-02 - 02:56 | HorngLiangShih | |
![]() |
CCIN2P3-WLCGT1SCM-LHCB-SW-Problem-Report-20101111.pdf | r1 | manage | 78.3 K | 2011-01-18 - 10:40 | JamieShiers | CCIN2P3 Shared s/w area interim report |
![]() |
CERN_OCSP_incident_report.pdf | r1 | manage | 46.9 K | 2020-06-29 - 21:40 | MaartenLitmaath | CERN Grid CA OCSP incident report, June 24-25, 2020 |
![]() |
Fibre_Cut_June_2009.pdf | r1 | manage | 177.9 K | 2009-07-06 - 08:30 | JamieShiers | |
![]() |
GOCDB_Outage_5th_March_2014.doc | r1 | manage | 30.5 K | 2014-03-13 - 15:52 | MaartenLitmaath | GOCDB Outage 5th March 2014 |
![]() |
GridKa_SIR_20100612.pdf | r1 | manage | 28.5 K | 2010-06-15 - 15:19 | UnknownUser | CMS dCache down for approx. 3h15 |
![]() |
GridKa_SIR_20100706.pdf | r1 | manage | 34.2 K | 2010-07-07 - 23:44 | JosVanWezel | |
![]() |
GridKa_SIR_PBS-Jan11.pdf | r1 | manage | 47.7 K | 2011-02-07 - 14:57 | AndreasHeiss | SIR about GridKa local batch system problems, January 2011 |
![]() |
GridKa_SIR_lost_files_alice_20110526.pdf | r1 | manage | 8.2 K | 2011-06-06 - 17:23 | JosVanWezel | KIT SIR loast files ALICE 5/2011 |
![]() |
GridKa_Service_Incident_Report_12082011.pdf | r1 | manage | 461.9 K | 2011-12-12 - 15:00 | XavierMol | |
![]() |
KDC-SIR.pdf | r2 r1 | manage | 66.1 K | 2011-08-23 - 14:52 | DirkDuellmann | |
![]() |
KIT_SIR_CMSChimeraDatabase_2018-08.pdf | r1 | manage | 196.1 K | 2018-08-20 - 10:15 | XavierMol | Database incident CMS dCache Aug 2018 |
![]() |
KIT_SIR_OnlineStorage_2022-03.pdf | r1 | manage | 167.3 K | 2022-08-12 - 10:33 | XavierMol | SE outage due to network intervention |
![]() |
KIT_SIR_StorageFTS_20121127.pdf | r1 | manage | 298.0 K | 2013-01-22 - 16:01 | XavierMol | SIR about offline FTS and dCache pool nodes end of Nov 2012 at GridKa. |
![]() |
KIT_SIR_Storage_20131028.pdf | r2 r1 | manage | 429.2 K | 2014-04-08 - 08:29 | XavierMol | 130 files lost for CMS |
![]() |
KIT_SIR_Storage_20141023.pdf | r1 | manage | 203.8 K | 2014-10-31 - 13:06 | ThomasHartmann | KIT: SIR: identification of file losses fro tape due to wrong end of tape markers |
![]() |
KIT_SIR_TapeStorage_2017-12.pdf | r1 | manage | 195.6 K | 2018-03-13 - 08:57 | XavierMol | SIR KIT Tape Storage Q4 2017 |
![]() |
LHCb_Databases_Upgrade_Migration_Incident_report.pdf | r1 | manage | 43.1 K | 2018-03-21 - 18:27 | IgnacioCoterillo | |
![]() |
POSTmortem-CMS-Oct2010.docx | r1 | manage | 117.8 K | 2010-10-15 - 13:51 | MaartenLitmaath | CMS storage down at CNAF Oct 6-10, 2010 |
![]() |
PostMortemTier-1ServiceIncidentRAIDCORRUPTIONAdaptec644515-03-2012.doc | r1 | manage | 52.5 K | 2012-04-13 - 21:13 | AlexeySedov | ATLAS Data Loss Incident at PIC |
![]() |
Post_Mortem_PIC_Tier-1_SIR_Computing_SSC5_20110525.pdf | r1 | manage | 94.9 K | 2011-06-01 - 16:18 | UnknownUser | SIR for the computing incident at PIC on 25/26th May 2011 |
![]() |
Post_Mortem_Tier-1_Service_Incident_dCache_PNFS_overload_10-June-2011.pdf | r1 | manage | 129.5 K | 2011-06-14 - 17:14 | UnknownUser | |
![]() |
Post_Mortem_Tier-1_Service_Incident_dCache_PNFS_overload_10-June-2011_f.pdf | r1 | manage | 129.9 K | 2011-06-16 - 15:53 | UnknownUser | |
![]() |
Post_mortem_LFC_indicent_23-26_May_2009_-_WikiPIC.pdf | r1 | manage | 163.7 K | 2009-05-27 - 17:28 | JamieShiers | |
![]() |
SIR-2018-CCIN2P3-DiskServerFailure.pdf | r1 | manage | 416.0 K | 2018-10-05 - 16:26 | EricFede | SIR for CCIN2P3 Data lost on xrootd storage |
![]() |
SIR-ALICE-KIT-overload-v2.pdf | r1 | manage | 78.8 K | 2014-05-07 - 18:52 | MaartenLitmaath | SIR about KIT firewall and OPN overload by ALICE jobs |
![]() |
SIR-CNAF--AtlasSRMoutage-April-2010.pdf | r1 | manage | 112.5 K | 2010-05-10 - 14:22 | HarryRenshall | CNAF ATLAS SRM blockage 28 April then MCDISK full STORM bug |
![]() |
SIR-FZK-20090907.pdf | r1 | manage | 74.9 K | 2009-09-29 - 14:42 | HarryRenshall | SIR of FZK degraded ATLAS RAC 7 to 16 Sep 2009 |
![]() |
SIR-IN2P3-CC-AFSoutage-2010-04-26.pdf | r1 | manage | 12.0 K | 2010-05-07 - 11:14 | HarryRenshall | SIR for IN2P3 AFS Outage |
![]() |
SIR-IN2P3-CC-BatchOutage-2010-04-24.pdf | r1 | manage | 15.4 K | 2010-05-04 - 09:48 | HarryRenshall | SIR of IN2P3 batch outage of 24/25 April 2010 |
![]() |
SIR-IN2P3-CC-CVMFS-2012-07-03-v0.pdf | r1 | manage | 6.9 K | 2012-07-18 - 23:06 | MaartenLitmaath | IN2P3-CC CVMFS inconsistency |
![]() |
SIR-IN2P3-CC-CVMFSSquid-2012-06-24-v2.pdf | r1 | manage | 8.7 K | 2012-08-29 - 22:17 | MaartenLitmaath | software area unavailable at IN2P3 on 24-Jun-2012 |
![]() |
SIR-IN2P3-CC-LHCb-AFS-Latency-2010-S2-v2.pdf | r1 | manage | 212.3 K | 2011-02-14 - 22:14 | MaartenLitmaath | Slow AFS response causing environment setup timeout for LHCb jobs |
![]() |
SIR-IN2P3-CC-Network-2011-02-13-v0.pdf | r1 | manage | 6.8 K | 2011-03-01 - 15:45 | MaartenLitmaath | IN2P3-CC core network switch outage due to CPU card failure |
![]() |
SIR-IN2P3-CC-Network-2011-03-14-v1.pdf | r1 | manage | 6.2 K | 2011-03-25 - 16:07 | MaartenLitmaath | IN2P3-CC hardware failure on network equipment |
![]() |
SIR-IN2P3-CC-OperationsPortal-2010-04-22v2.pdf | r2 r1 | manage | 17.2 K | 2010-05-07 - 11:14 | HarryRenshall | SIR for IN2P3 Downtimes Notification Impossible |
![]() |
SIR-IN2P3-CC-PowerIncident-2011-04-08-v0.pdf | r1 | manage | 8.1 K | 2011-04-14 - 11:29 | MaartenLitmaath | IN2P3-CC power incident Apr 8 |
![]() |
SIR-IN2P3-CC-PowerIncident-2011-08-26-v2.pdf | r1 | manage | 24.3 K | 2011-09-14 - 20:50 | MaartenLitmaath | IN2P3-CC cooling system failure Aug 26 |
![]() |
SIR-IN2P3-CC-WNs-disconnected-2010-02-15-2.pdf | r1 | manage | 10.5 K | 2010-02-25 - 14:28 | HarryRenshall | Worker node network connectivity loss at IN2P3 15 Feb 2010 |
![]() |
SIR-IN2P3-CC-dCache-2012-07-01-v1.pdf | r1 | manage | 6.7 K | 2012-07-18 - 22:59 | MaartenLitmaath | IN2P3-CC dCache downtime due to leap second |
![]() |
SIR-IN2P3-CC-lbms-DB-overload-2010-01-04.pdf | r1 | manage | 30.1 K | 2010-01-11 - 16:08 | DirkDuellmann | IN2P3 Local batch system database server overload |
![]() |
SIR-IN2P3-CC-network-2012-06-29-v0.pdf | r1 | manage | 5.7 K | 2012-07-16 - 20:04 | MaartenLitmaath | IN2P3-CC network outage |
![]() |
SIR-IN2P3-CC-network-2014-11-26-v0.pdf | r1 | manage | 31.6 K | 2014-12-01 - 10:00 | AndreaSciaba | |
![]() |
SIR-IN2P3-CC-network-2015-11-03-v3.pdf | r1 | manage | 33.1 K | 2015-11-12 - 14:18 | AndreaSciaba | |
![]() |
SIR-IN2P3-Dcache-ATLAS-Transfer-Degradation-2010-Q4-v3.pdf | r1 | manage | 281.6 K | 2011-02-11 - 19:27 | MaartenLitmaath | IN2P3-CC dCache transfer degradation for ATLAS |
![]() |
SIR20120921.pdf | r1 | manage | 31.9 K | 2012-10-16 - 18:31 | MaartenLitmaath | CNAF LHCb SE 6d downtime |
![]() |
SIR_201705.pdf | r1 | manage | 127.2 K | 2017-06-06 - 12:11 | MaartenLitmaath | GGUS outage of 2017-05-31 |
![]() |
SIR_ASGC_July_2012.pdf | r1 | manage | 292.8 K | 2012-11-21 - 18:42 | JhenWeiHuang | SIR_ASGC_July_2012 |
![]() |
SIR_BNL_CONDB.pdf | r1 | manage | 58.3 K | 2011-09-29 - 15:12 | MariaGirone | |
![]() |
SIR_BNL_DB_CFG.pdf | r2 r1 | manage | 50.6 K | 2011-09-20 - 10:01 | MariaGirone | |
![]() |
SIR_CCIN2P3_15aug2011.pdf | r1 | manage | 32.8 K | 2011-08-22 - 17:12 | JamieShiers | |
![]() |
SIR_CCIN2P3_19july2011.pdf | r1 | manage | 37.0 K | 2011-08-01 - 15:53 | MaartenLitmaath | IN2P3-CC database incidents due to disk drive failures |
![]() |
SIR_CCIN2P3_SRM_incident_08oct2009.doc | r1 | manage | 71.5 K | 2009-10-12 - 14:22 | JamieShiers | |
![]() |
SIR_CCIN2P3_cooling_outage_03nov2009.doc | r1 | manage | 12.5 K | 2009-11-06 - 17:37 | DirkDuellmann | IN2P3 cooling outage Nov 3rd |
![]() |
SIR_CNAF_20190829.pdf | r1 | manage | 49.9 K | 2019-08-29 - 18:42 | MaartenLitmaath | CNAF site outage Aug 6-21, 2019 |
![]() |
SIR_COOLING_OUTAGE_2009_05_03.pdf | r1 | manage | 26.7 K | 2009-05-22 - 14:05 | HarryRenshall | SIR for PIC cooling failure of 14 May 2009 |
![]() |
SIR_FZK-LCG2_2010-01-13.pdf | r1 | manage | 28.5 K | 2010-01-15 - 12:58 | UnknownUser | SIR FZK-LCG2 (GridKa/KIT) - Information system problems on 13th and 14th of January 2010 |
![]() |
SIR_GRID-FTP_OUTAGE_2009_06_11-1.pdf | r1 | manage | 73.9 K | 2009-06-16 - 11:06 | JamieShiers | |
![]() |
SIR_PIC_ATLAS_T10KD_20160519.pdf | r1 | manage | 24.3 K | 2016-05-19 - 10:05 | AreshVedaee | T10KD issue at PIC affecting ATLAS |
![]() |
SIR_PIC_COOLING_OUTAGE_2009_04_14.pdf | r1 | manage | 32.0 K | 2009-05-22 - 14:21 | HarryRenshall | SIR for PIC cooling failure of 2009.05.14 |
![]() |
SIR_PIC_COOLING_OUTAGE_2009_05_14.pdf | r1 | manage | 32.0 K | 2009-05-22 - 14:26 | HarryRenshall | SIR for PIC Cooling Outtage of 14 May 2009 |
![]() |
SIR_ROBOTIC_LIBRARY_OUTAGE_2009_04_22.pdf | r1 | manage | 22.8 K | 2009-04-25 - 10:06 | DirkDuellmann | |
![]() |
SIR_ROBOTIC_LIBRARY_OUTAGE_2009_04_26-3.pdf | r1 | manage | 17.6 K | 2009-04-30 - 11:50 | JamieShiers | |
![]() |
SIR_SARA_TAPEBACKEND_OUTAGE_2009_05_04.pdf | r1 | manage | 22.0 K | 2009-05-07 - 15:27 | HarryRenshall | SIR for SARA Tapebackend outage 4 to 6 May 2009 |
![]() |
SIR_cooling_failure_20100710.pdf | r1 | manage | 53.4 K | 2010-07-19 - 14:28 | UnknownUser | SIR of the cooling incident at KIT on July 10 |
![]() |
SIR_storage_FZK_GridKa.pdf | r1 | manage | 51.7 K | 2009-07-02 - 14:17 | JamieShiers | |
![]() |
SIRondatalossinASGCinOct.2016.pdf | r1 | manage | 32.1 K | 2016-11-11 - 14:21 | MaartenLitmaath | ASGC - loss of ATLAS data, 18 Oct 2016 |
![]() |
SIRs-by-Q-2012.xlsb | r1 | manage | 43.8 K | 2012-11-23 - 14:06 | JamieShiers | Spreadsheet for producing SIR plots for WLCG QRs |
![]() |
SURFsara_SIR_network_outage_30-6-2016.pdf | r1 | manage | 57.0 K | 2016-07-13 - 14:36 | UnknownUser | |
![]() |
SURFsara_Service_Incident_Report_-_bw32-1_backplane.pdf | r1 | manage | 4267.2 K | 2015-02-09 - 16:58 | AndreaSciaba | |
![]() |
Service_Incident_Report.pdf | r1 | manage | 177.2 K | 2014-01-14 - 12:09 | SimoneCampana | Service instabilities in the SURFsara grid storage cluster |
![]() |
Service_Incident_Report_for_BNL_Tier1-06-2013.pdf | r1 | manage | 28.3 K | 2013-06-26 - 21:56 | MichaelErnst | Service Incident Report for US ATLAS Tier-1 Center |
![]() |
Storage_incident_report_at_TRIUMF_Sep-16-2013.pdf | r1 | manage | 46.6 K | 2013-09-25 - 00:46 | RedaTafirout | TRIUMF incident report (lost files) |
![]() |
TRIUMF-dcs08lun0_incident_20161218.pdf | r1 | manage | 41.7 K | 2017-01-25 - 18:05 | DiQing | ATLAS lost files at TRIUMF due to hardware/firmware issue on December 18 2016 |
![]() |
TRIUMF-incident-report-april10-2012.pdf | r1 | manage | 29.8 K | 2012-04-27 - 02:36 | RedaTafirout | TRIUMF incident report |
![]() |
WLCG_AuthZ_Meeting_-_ATLAS_IAM_Outage_(31:10:2022)_-_CodiMD.pdf | r1 | manage | 354.6 K | 2022-11-28 - 12:25 | HannahShort | |
![]() |
post-mortem-CNAF-CE-Problem-Sept-2016.pdf | r1 | manage | 141.2 K | 2016-10-17 - 20:22 | MaartenLitmaath | |
![]() |
power_cut_ASGC.txt | r1 | manage | 0.6 K | 2009-07-31 - 16:19 | GangQin | power cut at ASGC on July 17th |
![]() |
power_surge_ASGC_20090118.txt | r1 | manage | 0.8 K | 2010-02-01 - 12:59 | GangQin | Po |
![]() |
sir-in2p3-cc-dcachesrmincident-2011-03-19-v2.pdf | r1 | manage | 7.1 K | 2011-03-28 - 14:08 | MaartenLitmaath | IN2P3-CC dCache SRM overload |
![]() |
sir-in2p3-cc-powerincident-2011-02-25-v0.pdf | r1 | manage | 7.8 K | 2011-03-07 - 19:18 | MaartenLitmaath | IN2P3-CC power incident Feb 25 |
![]() |
sir-kit-atlas-dcache-20110728.pdf | r1 | manage | 25.9 K | 2011-07-28 - 14:18 | AndreasPetzold | SIR ATLAS dCache data loss at KIT July 2011 |
![]() |
sir_BatchIncident_15_10_09.pdf | r1 | manage | 29.9 K | 2009-10-15 - 16:07 | JamieShiers | |
![]() |
sir_in2p3network_outage_10_12_2009.pdf | r1 | manage | 48.8 K | 2009-12-14 - 10:01 | HarryRenshall | SIR of IN2P3 DNS Load Balancing Failure 8 December 2009 |
![]() |
uscmsT1_SIR_042015.pdf | r2 r1 | manage | 46.7 K | 2015-05-04 - 15:00 | LucaMascetti | 2015-05 FNAL uscms lost files |