Site | Service Area | Date | Duration | Service | Impact | Report |
---|---|---|---|---|---|---|
PIC | CE | 21 Jun | 1 h | PIC Tier1 Computing | About 17% of the WN capacity switched off due to cooling incident | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120621_SIR_Cooling_Incident_at_PIC.pdf |
CERN | Storage | 18 Jun | ~1h | CASTOR | c2atlas diskservers were not reachable for ~1h | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsRmNodeMisconfiguration20120618 |
CERN | Storage | 5 Jun | 1 h | CASTOR | communication problems and client timeouts | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsNameServerContention20120605 |
PIC | CE | 3-4 Jun | 18 h | PIC Tier1 Computing | 18h of service degradation: Number of cores reduced by 60% due to cooling incident | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/20120603_SIR_Cooling_Incident_at_PIC.pdf |
CERN | DB | 22 May | 1.5 h | CMS online DB | 1.5 hours of high luminosity data lost | https://twiki.cern.ch/twiki/bin/view/DB/PostMortem22May12 |
CERN | Storage | 22 May | 5-40 min | CASTOR | ~1k unavailable files after transparent DB intervention | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsDegradationDBIntervention20120522 |
CERN | Infrastructure | 19-20 April | 1 day | batch | batch system down | https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchDown190412 |
CERN | Infrastructure | 18-20 April | 2 days | batch | ATLAS Tier-0 job submission system could not keep up with incoming RAW data | https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlow180412 |
ASGC | Storage | 11-12 April | 24 h | CASTOR | hardware failure, DB crashed | https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/ASGC_SIR_2012-04-11.pdf |
TRIUMF | All Tier-1 services | 10-11 April | 20 h | All Tier-1 services | Two site-wide power failures | https://twiki.cern.ch/twiki/pub/LCG/TempArea/TRIUMF-incident-report-april10-2012.pdf |
CERN | Storage | 4 April | 1.5 h | CASTOR | Name Server stuck, 3 CMS files had to be rewritten | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsCentralNSStuck20120404 |
CERN | Storage | 2 April | several days | CASTOR | 1 LHCb diskserver hardware issue (files unavailable, finally 3 file systems lost) | https://twiki.cern.ch/twiki/bin/view/CASTORService/IncidentsDiskOnlyDataLoss20120402 |