Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
ID-ITB | 63/63 | No Downtime declared - DCP |
JP-HIROSHIMA-WLCG | 65/65 | No Downtime declared - DCP |
MY-UM-CRYSTAL | 71/71 | random failures across the month - DCP |
MY-UPM-BIRUNI-01 | 24/24 | No Downtime declared - DCP |
MY-UTM-GRID | 0/0 | ![]() |
TW-NCUHEP | 61/61 | No Downtime declared - DCP |
VN-HPCC-HUT-HN | 2/2 | Unscheduled downtime of 7 days to check network connection - DCP |
VN-IFI-PPS | 38/38 | Unscheduled downtime of 6 days to put underground internet lines at Vinaren - DCP |
CentralEurope | ||
BY-JIPNR-SOSNY | 69/69 | Network problems - DCP |
FMPhI-UNIBA | 73/73 | unscheduled downtime during 9 days due to site configuration problems - DCP |
GermanySwitzerland | ||
GSI-LCG2 | 73/73 | Unscheduled downtime of 8 days to move nodes and reconfigure services - DCP |
GoeGrid | 74/74 | Unscheduled downtime of 1 hour due to power outage, name server problems |
TUDresden-ZIH | 68/68 | Unscheduled downtime of 1 day and 18 hours due to unknown problem with dCache |
UNI-FREIBURG | 73/73 | random failures across the month - DCP |
UNI-SIEGEN-HEP | 73/73 | No Downtime declared - DCP |
wuppertalprod | 60/60 | site failures during 12 days without downtime declared - DCP |
Italy | ||
INFN-BARI | 52/52 | The problem observed in last two weeks were mainly due to 2 different problems: 1) A couple of arrays were lost due to multiple concurrent disk failures. This leeds us to have problems on SAM SRM related test. 2) A problem on the Local batch system server. We had to switch on a new machine in order to solve this problem. We assume that now the problems are both solved, so we expect a far better behaviour. |
INFN-CAGLIARI | 61/71 | There was a host without lcg-CA updates and without lcg-voms.cern.ch on which gone to run almost all jobs test |
INFN-GENOVA | 71/71 | fiber channel disk problem and storm configuration problem |
INFN-NAPOLI-CMS | 29/29 | unforeseen electrical interruptions |
INFN-ROMA3 | 29/56 | Site was in downtime from 6th april through 21st april (two consecutive time slots). We received and installed two new racks, and a new air-conditioning system for such racks. Installation was not as smooth as expected and there were some delays. After the long downtime we had some issues with our Storm SE, in conjunction with FTS: we were flooded by Atlas jobs, which repeatedly failed and triggered new file copies until a couple of spacetokens filled up. At that point, we concentrated on solving this problem, which implied switching services on and off several times. |
INFN-TRIESTE | 62/99 | WN migration to sl5, CE update, new STORM version installation and poolaccounts reconfiguration |
NGI_PL | ||
PEARL-AMU | 0/0 | CE problem - DCP |
NorthernEurope | ||
EENet | 66/66 | intermittent failures across the month - DCP |
KTU-ELEN-LCG2 | 54/54 | No Downtime declared - DCP |
LSG-RUG | 53/53 | No Downtime declared - DCP |
T2_Estonia | 47/47 | |
UNIGE-DPNC | 69/69 | site failing for 10 days without downtime defined - DCP |
ROC_LA | ||
SAMPA | 0/0 | Site not tested during most of the month - DCP |
Russia | ||
RU-Phys-SPbSU | 2/2 | ![]() |
ru-IMPB-LCG2 | 63/63 | site failing for 11 days without downtime defined - DCP |
SouthEasternEurope | ||
BG02-IM | 70/70 | intermittent failures across the month - DCP |
BG07-EDU | 0/0 | Site not tested during most of the month - DCP |
GE-01-GRENA | N.A./N.A. | Site not tested during most of the month - DCP |
IL-TAU-HEP | 64/64 | intermittent failures across the month - DCP |
MK-02-ETF | 0/0 | Site not tested during most of the month - DCP |
RO-02-NIPNE | 0/0 | Unscheduled downtimes during the whole month. Cooling, storage & config problems reported - DCP |
TR-01-ULAKBIM | 53/53 | intermittent failures across the month. No downtimes - DCP |
TR-03-METU | 48/48 | intermittent failures across the month. No downtimes - DCP |
TR-04-ERCIYES | 69/69 | intermittent failures across the month. No downtimes - DCP |
TR-05-BOUN | 63/63 | intermittent failures across the month. No downtimes - DCP |
TR-07-PAMUKKALE | 69/69 | intermittent failures across the month. No downtimes - DCP |
TR-09-ITU | 68/68 | intermittent failures across the month. No downtimes - DCP |
TR-10-ULAKBIM | 66/66 | intermittent failures across the month. No downtimes - DCP |
WEIZMANN-LCG2 | 70/70 | intermittent failures across the month due to SE & BDII problems - DCP |
SouthWesternEurope | ||
BIFI | 55/55 | intermittent failures across the month due to top-BDII & WMS problems - DCP |
DI-UMinho | 8/8 | site failing most of the month without scheduled downtime - DCP |
IEETA | 37/44 | intermittent failures across the month - DCP |
MA-01-CNRST | 74/74 | unsched maintenance for 2 days and 8 more days with failures - DCP |
RedIRIS_GILDA | 37/37 | No Downtimes declared - DCP |
UMinho-CP | 65/65 | intermittent failures across the month. Unsched electric intervention during 1 day - DCP |
UNICAN | 54/54 | No Downtimes declared - DCP |
UPV-GRyCAP | 74/74 | No Downtimes declared - DCP |
UKI | ||
UKI-LT2-RHUL | 0/N.A. | Scheduled Downtime during the whole month to upgrade cluster to SL5 - DCP |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
CN-BEIJING-PKU | 0/0 | ![]() |
ID-ITB | 68/68 | No downtime declared - DCP |
JP-HIROSHIMA-WLCG | 50/53 | Hardware error in network switch - DCP |
MY-UM-CRYSTAL | 45/45 | No downtime declared - DCP |
MY-UPM-BIRUNI-01 | 66/66 | Network & Core switch upgrades - DCP |
MY-UTM-GRID | 0/0 | No downtime declared - DCP |
NCP-LCG2 | 66/73 | Network problem at ISP end. Config problems at DPM & WN. Testing services on new network. Power issues - DCP |
TH-HAII | 74/74 | No downtime declared - DCP |
TW-NCUHEP | 54/54 | No downtime declared - DCP |
TW-NIU-EECS-01 | 40/40 | ![]() |
VN-IFI-PPS | 56/56 | No downtime declared - DCP |
VN-IOIT-HN | 1/1 | ![]() |
CentralEurope | ||
FMPhI-UNIBA | 14/26 | Switch to gigabit link and reorganization of services - DCP |
egee.grid.niif.hu | 48/48 | "full envionmental change"?. EMI transition preparation - DCP |
GermanySwitzerland | ||
GoeGrid | 16/17 | continued upgrades, filesystem problems, srm problems and defect switch |
UNI-FREIBURG | 20/20 | NFS server problems. Software area unavailable - DCP |
UNI-SIEGEN-HEP | 45/45 | Nagios test org.sam.CE-JobSubmit-/ops/Role=lcgadmin failed for a while (ticket 4461) but magically cured itself; reason unknown. |
wuppertalprod | 52/52 | central file system failed. New network backbone will be installed - DCP |
Italy | ||
GRISU-CYBERSAR-CAGLIARI | 57/57 | Problem on local cluster and storage |
GRISU-CYBERSAR-PORTOCONTE | 45/45 | problem on SE due to the lcg-voms certificate expired, problem with a bdii set by mistake on WNs |
INFN-CAGLIARI | 63/63 | Maintenance on CE and some problems on few worker nodes on which we are investigating |
INFN-GENOVA | 66/93 | Change CC cooling system and rearrange racks. Operations on UPS |
INFN-PADOVA-CMS | 64/64 | problem related to a bug in the new STORM version, jobs submission problems (reinstallation of CE) and authentication problems due to few sgmops pool accounts |
INFN-PERUGIA | 14/24 | Hardware upgrade & issues on SE |
TRIGRID-INFN-CATANIA | 62/70 | UPS maintenance & Hardware problems - DCP |
NGI_PL | ||
PEARL-AMU | 29/70 | "maintenance"? - DCP |
PSNC-GILDA | 30/69 | "Initial site DOWNTIME"? - DCP |
NorthernEurope | ||
KTU-BG-GLITE | 41/41 | ![]() |
ROC_Canada | ||
CA-ALBERTA-WESTGRID-T2 | 59/83 | Upgrade of power services to the main campus CC. WNs migration from SL4 to SL5 - DCP |
ROC_LA | ||
ICN-UNAM | 41/58 | Site down due to building restructuration next to CC - DCP |
SAMPA | 0/0 | Power outage due to strong rains. Troubleshooting CEs - DCP |
Russia | ||
RU-SPbSU | 73/73 | Network broken - DCP |
ru-Chernogolovka-IPCP-LCG2 | 71/71 | Hardware upgrade - DCP |
SouthEasternEurope | ||
IL-TAU-HEP | 39/50 | reformatting & installing new storage - DCP |
RO-02-NIPNE | 73/73 | cooling problems - DCP |
RO-03-UPB | 26/26 | power failure, configuration issues - DCP |
RO-15-NIPNE | 66/66 | No downtime declared - DCP |
TR-10-ULAKBIM | 48/48 | No downtime declared - DCP |
SouthWesternEurope | ||
BIFI | 27/32 | Moving to a new building - DCP |
IEETA | 46/46 | No downtime declared - DCP |
MA-01-CNRST | 0/0 | Unscheduled maintenance during the whole month - DCP |
UB-LCG2 | 32/49 | Delay in RAID controller delivery. Power cut. GIP stopped publishing info after adding queues to Torque. - DCP |
UNICAN | 64/64 | No downtime declared - DCP |
UPV-GRyCAP | 63/63 | No downtime declared - DCP |
UKI | ||
UKI-SOUTHGRID-BRIS-HEP | 65/65 | No downtime declared - DCP |
cpDIASie | 66/66 | No downtime declared - DCP |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
CN-BEIJING-PKU | 32/32 | Down most of the month with no downtime declared - DCP |
HK-HKU-CC-01 | 63/70 | Down for 5 days with no downtime declared - DCP |
INDIACMS-TIFR | 13/13 | Unscheduled downtimes due to upgrades to glite-WN 3.2, AC & electrical maintenances & Squid server HDD error - DCP |
MY-UM-CRYSTAL | 37/37 | Down 12 days with no downtimes declared - DCP |
PH-ATENEO | 37/37 | Down most of the month with no downtimes declared - DCP |
TH-HAII | 68/68 | Down 5 days with no downtimes declared - DCP |
VN-HPCC-HUT-HN | 50/50 | Down half of the month with no downtimes declared - DCP |
VN-IFI-PPS | 8/8 | Down most of the month with no downtimes declared - DCP |
VN-IOIT-HN | 49/49 | Down half of the month with no downtimes declared - DCP |
CentralEurope | ||
CYFRONET-LCG2 | 69/80 | Unscheduled downtime of 5 days due to power maintenance, DPM reinstall & shared areas migration - DCP |
IFJ-PAN-BG | 60/72 | Power maintenance - DCP |
PSNC-GILDA | 0/0 | Site tested during just 2 days in the month, with failures - DCP |
prague_cesnet_lcg2 | 72/72 | Intermittent failures during the month - DCP |
France | ||
MSFG-OPEN | 62/62 | Site tested during just 8 days in the month, out of which 2 had failures - DCP. A new site with initial difficulties in the production environment - RR |
GermanySwitzerland | ||
GSI-LCG2I | 68/68 | Unscheduled downtime of 9 days due to power failure in UPS & cooling maintenance - DCP |
GoeGrid | 56/63 | Unscheduled downtime of 6 days for hardware upgrades and dCache Upgrade |
Italy | ||
INFN-BOLOGNA | 54/73 | Unscheduled downtime of 2 days & scheduled downtimes during 6 days (added new WNs) |
INFN-NAPOLI-CMS | 57/70 | Software upgrades & power cuts |
NorthernEurope | ||
EENet | 70/70 | intermittent problems during the month - DCP |
IMCSUL | 0/0 | Site tested during just a few hours in the month, with failures - DCP |
T2_Estonia | 70/70 | intermittent problems during the month - DCP |
VU-MIF-LCG2 | 54/54 | down 9 days, not tested 9 days and up 10 days - DCP |
ROC_Canada | ||
ALBERTA-LCG2 | 0/0 | Site upgrade and migration to Tier3 continued during 20 days of sched downtime - DCP |
CA-SCINET-T2 | 67/67 | Unscheduled downtimes to reconfigure hostnames and grid-user home space on CE - DCP |
ROC_LA | ||
SAMPA | 14/22 | power failure due to heavy rains in Sao Paulo & cluster did not restart afterwards. Troubleshooting CE, CREAM-CE & SE - DCP |
Russia | ||
ITEP | 40/40 | Unscheduled downtimes due to disk and network problems plus middleware upgrade - DCP |
ru-Chernogolovka-IPCP-LCG2 | 22/22 | No downtimes declared. Down during 22 days - DCP |
ru-Moscow-MEPHI-LCG2 | 0/0 | Unscheduled downtime during 89 days due to "General infrastructure problem" - DCP |
SouthEasternEurope | ||
GR-04-FORTH-ICS | 69/69 | Unscheduled downtime during 6 days to upgrade to SL5.3 and gLite 3.2 WNs - DCP |
HG-06-EKT | 59/88 | Upgrade of worker nodes to SL5/gLite-3.2 & migration from GPFS to Lustre - DCP |
IL-TAU-HEP | 57/57 | Problem on Lustre - DCP |
SouthWesternEurope | ||
BIFI | 68/78 | Scheduled downtime during 8 days to move to a new building - DCP |
DI-UMinho | 14/14 | Unscheduled downtimes during 11 days due to updates in CE certificate - DCP |
MA-01-CNRST | 11/11 | Unscheduled downtime of 21 days due to maintenance of the site ![]() |
UB-LCG2 | 0/0 | RAID controller replacement for the users home died and has to be replaced - DCP |
UKI | ||
UKI-LT2-UCL-CENTRAL | 39/65 | Scheduled donwtime during 10 days due to 'lfsck'ing Lustre' - DCP |
UKI-SOUTHGRID-BRIS-HEP | 0/0 | Scheduled downtime during the whole month due to DPM retirement & lcgce04's WN configuration - DCP |
UKI-SOUTHGRID-RALPP | 74/74 | Unsched downtime due to problems with air conditioning - DCP |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
IN-DAE-VECC-01 | 32/32 | Annual Maintenance of Cooling solution (unsched) - DCP |
IN-DAE-VECC-02 | 69/69 | Annual Maintenance of Cooling solution (unsched) - DCP |
INDIACMS-TIFR | 7/7 | network outage (unsched), system disk fail on DPM disk server (unsched) - DCP |
MY-MIMOS-GC-01 | 64/64 | hadware failure (unsched) - DCP |
MY-UM-CRYSTAL | 67/67 | |
NCP-LCG2 | 48/48 | CE issues for 14 days (unsched) - DCP |
PAKGRID-LCG2 | 17/17 | internet link down during 70 days (unsched ![]() |
PH-ATENEO | 31/31 | |
TH-HAII | 57/57 | |
TW-NCUHEP | 69/69 | |
VN-HPCC-HUT-HN | 67/67 | |
CentralEurope | ||
PEARL-AMU | 74/74 | |
GermanySwitzerland | ||
SCAI | 57/57 | timeout requests to DPM for 6 days (unsched) - DCP |
Italy | ||
INFN-CATANIA | 72/72 | unexpected problem with the giga-pop connection to GARR |
INFN-FERRARA | 72/72 | power supply problem for some days |
NorthernEurope | ||
BEgrid-KULeuven | 46/46 | |
VU-MIF-LCG2 | 2/2 | Reinstall WN from SLES to SLC. Cooling problems. Whole month (unsched) - DCP |
ROC_Canada | ||
ALBERTA-LCG2 | 14/29 | Site upgrade and migration to T3 during 8 days (unsched ![]() |
CA-SCINET-T2 | 61/68 | Sched downtime to move & reconfig dCache disks & softw updates - DCP |
SDU-LCG2 | 70/70 | |
ROC_IGALC | ||
UFRJ-IF | 61/61 | Problem with DPM & powercut (unsched) - DCP |
ROC_LA | ||
UNIANDES | 66/73 | Power failure, all personal on vacation (unsched for 8 days) - DCP |
Russia | ||
ru-IMPB-LCG2 | 53/53 | |
ru-Moscow-MEPHI-LCG2 | 0/0 | Severe infrastructure problem (unsched for 25 days) - DCP |
SouthEasternEurope | ||
IL-TAU-HEP | 55/55 | |
RO-11-NIPNE | 53/53 | Internal network & hardware problems during 11 days (unsched) - DCP |
SouthWesternEurope | ||
MA-01-CNRST | 59/59 | Maintenance of the site during 21 days (unsched ![]() |
UKI | ||
UKI-LT2-UCL-CENTRAL | 72/72 | The batch scheduler (moab) runs from lustre and is often slow to dispatch the ops jobs and job output retrieval is also problematic (also lustre related). The lustre filesystem has since had a thorough check. |
UKI-SOUTHGRID-OX-HEP | 58/75 | Air cond. failure, 6 days (unsched), 7 days (sched) - DCP |
UKI-SOUTHGRID-RALPP | 64/94 | Outage building power supply & reconfig dCache - 10 days (sched) - DCP |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
ID-ITB | 64/64 | |
IN-DAE-VECC-02 | 63/63 | Site update and Cream-CE upgrade for which no scheduled downtime was declared - DCP |
INDIACMS-TIFR | 8/8 | Networking issues during the whole month for which no scheduled downtimes were declared - DCP |
MY-MIMOS-GC-01 | 52/52 | Power failures during 13 days for which no scheduled downtimes were declared - DCP |
NCP-LCG2 | 48/48 | Deploying, testing & troubleshooting new WNs and CE to SL5 during 2 weeks without scheduled downtime - DCP |
PAKGRID-LCG2 | 45/45 | Hardware problems during 21 days without scheduled downtime - DCP |
PH-ASTI-BUHAWI | 46/46 | |
PH-ATENEO | 47/47 | |
VN-HPCC-HUT-HN | 67/67 | |
VN-IFI-PPS | 51/51 | |
VN-IOIT-HN | 63/63 | |
CentralEurope | ||
BY-BNTU | 0/0 | Site created on mid December but received no tests - DCP |
France | ||
IPSL-IPGP-LCG2 | 59/97 | Scheduled downtime of 12 days for CE winter break followed by some reconfig - DCP |
M3PEC | N.A./N.A. | Existing site, but no tests received during the whole month - DCP |
GermanySwitzerland | ||
TUDresden-ZIH | 61/77 | Network and batch problems after software update on the machine during 7 days with unscheduled downtimes - DCP |
UNI-SIEGEN-HEP | 59/59 | Long debugging of SAM test failure (CE-sft-job) which didn't indicate any apparent reason and appeared erraticly. Finally fixed. (GUUS-Ticket 53960) - WW |
NorthernEurope | ||
VGTU-TEST-gLite | N.A./N.A. | Existing site, but no tests received during the whole month - DCP |
ROC_Canada | ||
CA-SCINET-T2 | 52/54 | Main core network switch power supply died and no tests during first half of the month. New site or site name changed? - DCP |
TORONTO-LCG2 | 2/2 | Unscheduled downtime during the whole month. Migration to new Tier-2 site - DCP |
ROC_IGALC | ||
CEFET-RJ | 69/69 | Unforeseen power cut during 2 days with unscheduled downtime declared - DCP |
UFRJ-IF | 58/58 | Globus job manager problem, unexpected power cut, unscheduled maintenances, DNS problems on campus - DCP |
ROC_LA | ||
SAMPA | 41/52 | Air conditioning off, so cluster down from Dec. 28th - DCP |
UNIANDES | 44/60 | Sched site maintenance of 5 days plus unsched power failure since Dec. 28th - DCP |
Russia | ||
ru-IMPB-LCG2 | 63/63 | Several unscheduled maintenances declared - DCP |
ru-Moscow-MEPHI-LCG2 | 41/41 | Severe optical backbone damage outside the site (unsched 7 days) plus severe infrastructure problem (unsched, 11 days) - DCP |
ru-PNPI | 54/54 | Unsched downtime of 7 days for a migration to SL5 - DCP |
SouthEasternEurope | ||
RO-09-UTCN | 58/58 | Lots of random CE-sft-lcg-rm errors due to timeouts on our central BDII server. |
RO-11-NIPNE | 59/59 | Unsched downtime of 10 days due to Internal grid networking problems, updates for SL5 - DCP |
TR-03-METU | 68/68 | The site DPM server (eymir.grid.metu.edu.tr) had max loads in December, it used all its memory and some swap area. The usage of swap area causes connection timeouts in DPM services. Its memory had been extended. After upgrading the memory size, site services begun to work stable |
SouthWesternEurope | ||
UNICAN | 73/73 | |
UKI | ||
UKI-LT2-UCL-CENTRAL | 32/69 | Sched downtime for 17 days for awaiting, testing & deploying new kernel from vendor plus unsched downtime of 4 days to reboot nodes and push out kernel across the cluster - DCP |
UKI-SOUTHGRID-OX-HEP | 72/72 | Unsched downtime for 5 days due to problem at batch system of shared cluster plus catastrophic air cond. failure - DCP |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
CN-BEIJING-PKU | 50/50 | |
INDIACMS-TIFR | 45/46 | Network problems between TIFR and all T1 other than CERN. Investigations by DANTE/GEANT on-going. Site should have declared 24hr unscheduled, and scheduled the rest of the downtime (but didn't) - JRS |
JP-HIROSHIMA-WLCG | 66/66 | Unscheduled 14-day system migration to SLC5 -JRS |
MY-UM-CRYSTAL | 37/37 | |
PAKGRID-LCG2 | 73/73 | Hardware problems, but no judicious use of downtime declarations -JRS |
PH-ASTI-BUHAWI | 25/25 | |
TW-NCUHEP | 72/72 | |
TW-NIU-EECS-01 | 68/68 | |
VN-IFI-PPS | 70/70 | |
VN-IOIT-KEYLAB | 63/63 | |
CERN | ||
CEFET-RJ | 70/70 | City's power system went offline, plus m/w and security issue -JRS |
SDU-LCG2 | 74/74 | |
TORONTO-LCG2 | 74/74 | 14-day downtime declared which could have been "scheduled" but wasn't -JRS |
UFRJ-IF | 70/70 | Severe power failures -JRS |
CentralEurope | ||
BY-UIIP | 56/56 | Unscheduled(!) move to SL5 and gLite 3.2 with new hardware, followed by BDII and WMS problems (GGUS-54007)![]() |
PEARL-AMU | 56/56 | updated vulnerable kernel on all nodes, NFS-lock and DNS problems -JRS |
egee.grid.niif.hu | 68/68 | "The site interfacing operational problem", err what does that mean? -JRS |
France | ||
IN2P3-IRES | 72/72 | important upgrade following a security vulneralibity impacting some services coupled with temporary lack of site admin manpower slowing down global reactivity |
GermanySwitz. | ||
BMRZ-FRANKFURT | 0/0 | OK, new site which appeared at end of the month |
GSI-LCG2 | 26/26 | Unscheduled(!) 10-day reconfiguration of some services -JRS |
SWITCH | 68/68 | we had a frequent and severe time skew problem on our vmware machines. We recently set up a cronjob that synchronises the time every 10 seconds and since then we are consistently passing the sam tests. Furthermore, we reinstalled several times our WNs as we are now participating in the ARGUS delpoyment pilot(coordinated by CERN) |
TUDresden-ZIH | 43/46 | Problems after upgrade of dCache -JRS |
Italy | ||
GRISU-COMETA-ING-MESSINA | 54/72 | departmental internet link failure (unscheduled downtime from 5th to 6th Nov + scheduled downtime from 7th to 14th Nov), recurrent (twice daily, on average) SRMv2 failures caused by Gridmap bug, as per GGUS 52668![]() |
GRISU-CYBERSAR-CAGLIARI | 62/62 | Network configuration problems not covered by a declared downtime - JRS |
INFN-CATANIA | 74/74 | |
INFN-FRASCATI | 56/76 | site maintenance and storage system failure (unscheduled+scheduled downtime from 16th Nov to 27 Nov) - In November we scheduled a long downtime (4 days), in order to perform many operations (among which move the farm to a new VLAN, install some new FC switch, move WN from SL4 to SL5, etc...), but we met many problems hardware and software problems and we had to extend the downtime . When we eventually finished the work (after 4.5 days), we reboot all the machines and one HP blade chassis had an hardware failure. This happened on Friday afternoon, after 5 PM, so HP support was already unavailable and we couldn't call them before Monday morning (then we put an unscheduled downtime for 4 days). Moreover this failure was difficult to detect even for HP technicians, so we spent 2 days investigating and exchanging log files with them. The conclusion was that there was a failure on the blade chassis and on any blade server, so we had to wait for 2 additional days to have spare parts. Two of these blade servers were the DPM storage element and one of the DPM disk pool, so it was not possible to remove the downtime before to solve the failure. We've just bought a new storage element (a single unit server not part of a blade), so we do not expect this problem again. |
INFN-NAPOLI | 50/50 | site maintenance and software upgrade |
INFN-NAPOLI-ARGO | 70/70 | Problems in storage server |
INFN-NAPOLI-PAMELA | 34/57 | hardware maintenance (unscheduled+sheduled downtime from 5th to 24th Nov) |
SNS-PISA | 61/61 | the reason of the low availability of the SNS-SITE for this month is due to some new configuration and hardware update.I just realized - and probably I just stabilized - a configuration with a strict map "node - queue/VO": the users belonging to a specific VO are able to use only a set of reserved nodes, without any kind of fair share over them. This is required for a particular, and probably temporary, requirement of some internal users.The testbed should be at the moment OK, so I hope the site availability will increase in the next few days. |
LatinAmerica | ||
SAMPA | 3/15 | Scheduled migration from gLite 3.1 to 3.2, plus major power problems in Brazil -JRS |
Russia | ||
RU-Novosibirsk-BINP | 5/7 | |
RU-Phys-SPbSU | 0/0 | DomainName registration(delegation) problem -JRS |
ru-IMPB-LCG2 | 62/82 | Scheduled maintenance -JRS |
SouthEasternEurope | ||
BG-INRNE | 74/74 | |
BG02-IM | 67/67 | Unscheduled 21 days of hardware tests and repairs -JRS |
IL-TAU-HEP | 43/45 | Storage problems following building move -JRS |
MK-01-UKIM_II | 69/69 | Various file-system problems plus node migrations which were not scheduled -JRS |
RO-15-NIPNE | 73/73 | |
TECHNION-HEP | 74/74 | Storage system problems -JRS |
TR-03-METU | 50/55 | A scheduled downtime for migrating to all WNs to SL5.3 and gLite 3.2. |
TR-09-ITU | 58/58 | The site had two different problems in November. One of them is that hardware problem was detected on computing element and had been fixed. The other is the air conditioner problem in the system room of ITU. |
TR-10-ULAKBIM | 74/74 | Unscheduled downtime because of the changes on electricity structure of system room |
SouthWestEurope | ||
BIFI | 32/32 | Network outage at university, missconfigured authetication in CE and WMS -JRS |
LIP-Coimbra | 66/73 | Scheduled building power cut, plus problems with Lustre 1.8.1.1 upgrade -JRS |
UK/I | ||
UKI-LT2-UCL-CENTRAL | 53/63 | Proper use of downtimes! Scheduled m/w upgrade, plus emergency work on air conditioning ducts -JRS |
UKI-SOUTHGRID-BHAM-HEP | 67/67 |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
INDIACMS-TIFR | 60/60 | network problems causing timeouts of transfers and SAM tests - JRS |
JP-KEK-CRC-01 | 57/57 | 13 day unscheduled downtime for "Removal work for obsolete tags glite 3.0" - JRS |
MY-UM-CRYSTAL | 36/36 | Poor quality of external networking at campus. Tracking intermittent outage issue with site admin |
MY-UPM-BIRUNI-01 | 66/67 | 9h scheduled downtime for "Power maintenance", but rest is unexplained -JRS |
PH-ASTI-LIKNAYAN | 49/49 | OK, site appeared during the month (18 Oct) - JRS |
TH-NECTEC-LSR | 69/69 | hostcert of CE expired early in Oct during a holiday break |
TW-NCUHEP | 18/18 | 4 days unscheduled downtime for core grid services and computing farm upgrade. Worked with site admin and majority of site services have been recovered end of Oct. |
TW-NTCU-HPC-01 | 71/71 | SE host crashed (1 day outage), plus 1d17h unscheduled power maintenance - JRS |
VN-IOIT-KEYLAB | 46/46 | DNS failures, site admin have limited effort recovering the problem and still waiting confirmation from local NOC people |
CERN | ||
CEFET-RJ | 57/57 | OK, site appeared during the month (20 Oct) - JRS |
SDU-LCG2 | 42/42 | Change vulnerable WNs, APEL & CE-sft-job failures, problems with the service CE-sft-lcg-rm -JRS |
TORONTO-LCG2 | 44/44 | One disk server had been lost and not been found for two weeks |
CentralEurope | ||
BY-UIIP | 19/19 | Lots of unscheduled downtimes: Problems with accounting and SE, network problems, h/w upgrade, moving to SL5 and glite3.2 (unscheduled!) -JRS |
FMPhI-UNIB | A40/40 | CE&LRMS problem, plus a large unscheduled downtime due to emergency security shutdown -JRS |
TASK | 67/68 | unscheduled network reconfiguration and CE backup restoration -JRS |
GermanySwitz. | ||
GSI-LCG2 | 69/69 | unscheduled power-cut in main computing centre, plus service reconfiguration |
GoeGrid | 71/73 | Hardware defect of network interface, plus scheduled maintenance |
SWITCH | 52/52 | We indeed had various problems in September which we could only address a month ago. However the other sites did NOT timely update to the June IGTF release which contained the new QuoVadis Grid CA, hence resulting in our site not been trusted for two months or so. |
TUDresden-ZIH | 74/74 | there were several reasons. One was a sudden power failure at the university campus. And then we had some network problems which let the SAM tests fail from time to time (and without any known reason) |
UNI-SIEGEN-HEP | 69/69 | Debugging CE problems with main admin absent-JRS |
Italy | ||
GRISU-CYBERSAR-PORTOCONTE | 74/74 | site back in production on15th Oct after having changed domain and renamed the nodes |
GRISU-SPACI-NAPOLI | 72/74 | scheduled m/w update plus network problem |
INFN-CATANIA | 73/73 | unscheduled and difficult upgrade of DPM database schema |
INFN-NAPOLI | 69/69 | Network problem ( switch serving WNs broken) + queues misconfiguration (no slots reserved to certification jobs) |
INFN-TORINO | 57/86 | scheduled downtime for site upgrade |
SISSA-Trieste | 72/72 | lcg-cr random errors. Investigations in progress |
NorthernEurope | ||
BEgrid-KULeuven | 64/64 | Bitten by BDII upgrade problem (6 days) -JRS |
ITPA-LCG2 | 70/70 | expiration of the host certificates |
NDGF-T1 | 72/72 | Upgrade of dCache version & subsequent SRM and SAM test problems, TSM/dCache configuration problems -JRS |
NO-NORGRID-T2 | 67/67 | SAM tests fail because of SRM problems after dCache upgrade -JRS |
T2_Estonia | 65/65 | Overloaded CE causing unstable cluster, SAM tests failing due to missing SE name while checking free space -JRS |
UNIGE-DPNC | 58/58 | 6d unscheduled Security maintenance. "Don't know yet how long they will be out of action" -JRS |
Russia | ||
BY-NCPHEP | 34/34 | Site appeared during the month, plus unscheduled SL5 + gLite 3.2 installation - JRS |
RU-Novosibirsk-BINP | 0/0 | Site appeared on 15 Oct, but hasn't passed any tests since -JRS |
RU-Phys-SPbSU | 40/40 | DomainName registration(delegation) problem -JRS |
Ru-Troitsk-INR-LCG2 | 62/98 | OK, downtime declared correctly for SL5 migration - JRS |
SouthEasternEurope | ||
BG02-IM | 42/42 | "Hardware tests and repair" -JRS |
IL-TAU-HEP | 74/74 | The site is new in the production level and they had to deal with a few small problems plus an electric major problem. Because of this, this week they have moved the site to a more reliable/safer place |
RO-08-UVT | 69/69 | The site suffered from problem of the BDII LDAP service during October (from 20.10 to 31.10) |
RO-09-UTCN | 43/43 | Unscheduled DNS failure, root exploit fix, CRL update failures, yaim configuration stalled while configuring the CE |
RO-14-ITIM | 37/37 | Storage element down (19d, unscheduled) -JRS - The site suffered from a numerous of problems during the reporting period: 1. kernel update problem, 2. hardware problem in the Storage element (the system had to be reinstalled) 3. network problems (hardware problem to the main router of site’s ISP) – the network problem site persists – they are working on it |
RO-15-NIPNE | 62/62 | |
TR-01-ULAKBIM | 22/22 | Unscheduled CE, SE and WNs upgrade -JRS. After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update. The top-level BDII is stable after a fresh installation of bdii packages and configuration. In addition to that TR-01-ULAKBIM site has been in maintanence for nearly 2 weeks. It is upgraded with new hardwares for CE and WNs, also it has been migrated to glite3.2 with SL5.3. Since 28/10/2009, the TR-01-ULAKBIM site is now working with 1200 worker nodes with IB interconnect. |
TR-03-METU | 62/62 | After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update. |
TR-04-ERCIYES | 61/61 | Unscheduled upgrade of CE and SE -JRS After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update. |
TR-07-PAMUKKALE | 68/68 | After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update. |
TR-09-ITU | 71/71 | Unscheduled downtime in CE, possible hardware problem in the server -JRS After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update. |
TR-10-ULAKBIM | 67/67 | After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update. |
SouthWestEurope | ||
MA-01-CNRST | 53/53 | Unscheduled installation of new SE, plus unscheduled site update to solve local root vulnerability MA-01-CNRST -JRS |
UB-LCG2 | 60/60 | Seriuos but still unspecified network issues at the university hosting the Tier-2 |
UMinho-CP | 61/61 | |
UNICAN | 63/63 | |
UK/I | ||
UKI-LT2-QMUL | 66/66 | Various hardware and software issues; NFS server down; unscheduled Lustre Filesystem upgrade -JRS |
UKI-LT2-UCL-CENTRAL | 69/69 |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
CN-BEIJING-PKU | 0/0 | Site has been suspended for consistent poor availability |
HK-HKU-CC-01 | 51/51 | |
JP-KEK-CRC-01 | 14/14 | |
MY-UM-CRYSTAL | 0/0 | |
PAKGRID-LCG2 | 73/73 | Hardware problem with PC running siteBDII |
TH-HAII | 74/74 | |
TW-NCUHEP | 3/3 | Infrastructure upgrade, but with no scheduled downtime |
TW-NTCU-HPC-01 | 73/73 | System upgrade, downtime not scheduled |
Taiwan-IPAS-LCG2 | 19/19 | Power maintenance and site upgrade, but no scheduled downtime. Site has been suspended for consistent poor availability |
VN-IFI-PPS | 52/52 | |
VN-IOIT-HN | 43/43 | |
CERN | ||
Umontreal-LCG2 | 73/73 | We had two problems this month. For the first problem, ticket GGUS-51387![]() |
CentralEurope | ||
ELTE | 49/49 | Host certificates expired & problems getting new ones (GGUS-51953)![]() |
FMPhI-UNIBA | 69/69 | Security vulnerability; site closed during fix (GGUS-51952)![]() |
PEARL-AMU | 57/57 | Maintenance works on SE, SQLite DB problems and problems with two WNs (GGUS-51951)![]() |
egee.fesb.hr | 55/55 | Problems with SE & installation of new WNs (GGUS-51950![]() |
GermanySwitz. | ||
SWITCH | 5/5 | Nodes reinstallation due to "unexpected behaviour" |
UNI-SIEGEN-HEP | 57/57 | We had only 57% availability due to two downtimes. The first one was a power cut at our university in Siegen and the second one was a software upate on our whole computer cluster this month. Some additional problems with Kernel patches causes then the availability you have seen. |
wuppertalprod | 70/70 | As we had one "black hole" WNs which had a wrong time. ATLAS jobs were not matched to that node (so it basically on ran ops jobs), so from ATLAS we were efficient and we have realized the problem too late. After fixing this, the problem went away. |
Italy | ||
CNR-ILC-PISA | 55/55 | Problem with fetch-crl file, related command wasn't referenced into the cron itself (GGUS-51355![]() |
CNR-PROD-PISA | 51/51 | several test jobs blocked the CE, causing failures with "proxy expired" (GGUS-51539)![]() |
ESA-ESRIN | 54/99 | scheduled downtime from 2009-08-10 to 2009-09-14 for: gLite update on CE, Storm, LFC + migration of some gLite services to an other server (site remained in production because the first duration of the downtime was less tyhan a month) |
INFN-FERRARA | 44/44 | problem related to the SE: it is on a virtual machine, and some settings have been changed (in particular a network port dedicated), hoping it will be more stable |
INFN-NAPOLI | 57/57 | unexpected blackout problems, involving also the UPS |
INFN-NAPOLI-PAMELA | 68/69 | unexpected blackout problems, involving also the UPS |
INFN-PERUGIA | 45/92 | scheduled downtime from 2009-07-02 to 2009-09-15 for datacenter changing room. Site status "uncertified" during that period |
NorthernEurope | ||
BEgrid-KULeuven | 51/51 | |
BEgrid-ULB-VUB | 74/74 | |
CSC | 57/57 | disk system failure |
EENet | 41/41 | CE in testing, unscheduled |
HPC2N | 64/64 | New lcg-CE was not in-place until the 11th of September |
HTC-BIGGRID | 63/63 | |
ITPA-LCG2 | 66/66 | We were installing/configuring spektras.itpa.lt CE |
KTU-BG-GLITE | 71/71 | |
Russia | ||
RRC-KI | 23/23 | Upgrade to RHEL 5 and gLite 3.2, scheduled downtime declared |
RU-SPbSU | 61/61 | |
ru-Chernogolovka-IPCP-LCG2 | 65/65 | |
ru-PNPI | 20/20 | |
SouthEasternEurope | ||
BG02-IM | 73/73 | |
GR-04-FORTH-ICS | 1/1 | StoRM configuration error, unscheduled downtime extended several times |
GR-09-UoA | 74/74 | computer centre move |
IL-BGU | 21/21 | |
TECHNION-HEP | 64/64 | Problem with the ntp deamon running at the site. It is fixed. |
TR-04-ERCIYES | 20/20 | TR-04-ERCIYES had a Maradona error on job submission. The admins reconfigured the whole site, checked logs and followed up the network traffic before but SAM tests were not stable. The CE and SE upgraded with new hardware and the site were reconfigured once more. Now SAM tests are OK. We are waiting to be stable after the new upgrades. |
SouthWestEurope | ||
CFP-IST | 58/58 | Harware failure, system admin is away |
DI-UMinho | 70/70 | |
LIP-Coimbra | 31/98 | Scheduled cluster reconfiguration and upgrade |
UB-LCG2 | 54/54 | Unspecified network issues |
UK/I | ||
UKI-LT2-RHUL | 58/58 | |
UKI-LT2-UCL-CENTRAL | 45/60 | Scheduled upgrade of cluster filesystem |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
CN-BEIJING-PKU | 14/36 | |
IN-DAE-VECC-EUINDIAGRID | 11/11 | |
JP-HIROSHIMA-WLCG | 68/84 | |
MY-MIMOS-GC-01 | 51/51 | |
TH-HAII | 58/58 | |
TW-NCUHEP | 54/54 | |
Taiwan-IPAS-LCG2 | 26/26 | |
VN-IOIT-KEYLAB | 47/48 | |
CentralEurope | ||
BY-UIIP | 69/69 | Site administrator was blocking ports >55000 for incoming connection and it caused failures for CE-host-cert-valid test. |
ELTE | 47/82 | Problem with obtaining new host certificates for site caused by holiday period. |
HEPHY-UIBK | 68/95 | Site was forced by the university to go offline because of some upgrades. Additionally site had also serious problems with storage system. |
PEARL-AMU | 36/55 | Low availability and reliability were caused by problem with disk space. |
France | ||
ESRF | 57/73 | |
IN2P3-LPSC | 54/85 | Two weeks of SD was planned with the approval of the French ROC due to foreseen electrical work in the machine room. That then explains the poor monthly availability. |
IPSL-IPGP-LCG2 | 51/77 | One week of SD was planned with the approval of the French ROC due to the lack of staff during the vacations. The SD had to be extended to find and to apply the fixes linked with the security alerts. |
GermanySwitz. | ||
GoeGrid | 68/70 | Unscheduled downtime due to the emergency power outage of a burning transformer.In addition we were facing problems with our site-BDII. Those problems only occured from time to time, but they caused the service unstable and lower our overall availability. |
SWITCH | 40/40 | As for the performance of the site in July and August this was mostly due to the fact that the installation of the June EUGridPMA bundle (and later versions) was delayed consistently at the partner sites: our CE certificate is in fact a new SWITCHpki grid server certificate (issued by QuoVadis) which was installed last June and for quite some time the SAM tests were failing because the QuoVadis CA was not recognized by the various central grid nodes such as the WMS etc. |
UNI-SIEGEN-HEP | 52/81 | we had only 52% availability due to unscheduled downtime. Our main adminitrator has left and so we need some time for solving problems at the moment. In august we had the downtime due to the security incident related to some Kernel versions. |
Italy | ||
ESA-ESRIN | 29/93 | scheduled downtime for gLite update on CE, Storm, LFC + migration of some gLite services vs an other server. |
INFN-BARI | 74/74 | In August, while the Availability of the INFN-BARI site was slightly above the threshold, the reliability has been 74%, which is in fact below the threshold, but only slightly. This low reliability was due to two reasons:1) In the first days of august we were still recovering from the chiller fault occurred on the 28th of july. 2) During the two central weeks of august, the Physics Department in Bari was closed and the personnel in vacation. In such conditions the farm monitoring and management could only be done remotely and with a reduced man-power. This caused some delays in the discovery of faulty components and in their restarting/replacement. |
INFN-CAGLIARI | 67/67 | |
INFN-FERRARA | 51/51 | the site is experiencing intermittent problems with the SAM tests since quite some time. We run extensive tests on the RAM (even if the machine uses ECC ram) but we ruled it out. What puzzles us is that sometimes the test flips from ok to not ok (or viceversa) even if we don't touch the system. We are checking the different logs and already tried a few things but not found the solution yet. By the way I checked with a colleague working for LHCb experiment, and their tests have not spotted a single glitch in the last month (i.e. we are fully certified to run their jobs). I did this crosscheck just because I know one of the people there: I know is not an extensive test but still the result puzzles me. |
INFN-LNS | 21/46 | From 8 till 23 august scheduled downtime for summer holidays. From 24, General malfunction on site for UPS problems. Only 1 people available, this means a lot of time for fix the open tikets. |
INFN-NAPOLI | 13/99 | hardware problems and some updates missing caused a lot of SAM failures |
INFN-NAPOLI-ARGO | 51/53 | some missing updates (CAs) due to the holiday period |
INFN-NAPOLI-PAMELA | 34/95 | farm unattended and in downtime for almost all the month |
NorthernEurope | ||
BEgrid-UGent | 65/65 | No reply from site despite several mails and a GGUS ticket. Site suspended 28/9/09 by NE ROC |
CSC | 67/83 | The egee-ce.csc.fi front end was down mainly due to the unforeseen disk failures on the Murska cluster. Actually the front end itself was ok but only the WNs and shared filesystem |
HPC2N | 1/1 | We are still trying to set up a new configuration for a new cluster. We got an SE (DPM) up the 25th of August but we are still having problems with the CE. It is our hope that we can get it up before the 14th of September. |
ITPA-LCG2 | 53/91 | We've been using for a long time a 'sdj' queue for the sam tests, but for some reason a WMS, routing the sam jobs from the CERN, stopped sending them to the queues named 'sdj'. I have indicated this in the corresponding GGUS ticket assigned for me. Although the CERN sam tests were failing, our site was fully functioning as you could see at balticgrid's local sam tests website, sam.mif.vu.lt |
PDC | 63/69 | The top BDII failed a couple of times, unfortunatly over week-ends |
Russia | ||
BY-NCPHEP | 0/ n/a | |
Ru-Troitsk-INR-LCG2 | 71/72 | |
ru-PNPI | 61/61 | |
SouthEasternEurope | ||
HG-05-FORTH | 71/73 | The top-bdii used by the site is the mon01.ariagni.hellasgrid.gr, the site failures caused by problems occurred to this top-BDII. The site admins informed us that they changed the top-BDII used by the site, to th bdii.core.hellasgrid.gr round robin mechanism we maintain in the HellasGrid infra. |
RO-08-UVT | 47/93 | |
RO-13-ISS | 53/74 | 1.From 03.08.2009 to 13.08.2009 the main power cable pierced - no electricity. The electrical network and power capacity has been upgraded (see link![]() |
TECHNION-HEP | 73/73 | |
!SouthWestEurope | ||
ESA-ESAC | 58/58 | |
LIP-Coimbra | 0/ n/a | |
UAM-LCG2 | 67/94 | |
UPV-GRyCAP | 17/19 | |
e-ca-iaa | 54/54 | |
UK/I | ||
UKI-LT2-RHUL | 0/0 | The site bdii was marked as being in downtime for the whole of this month, the site was otherwise operational. |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
Australia-ATLAS | 37/37 | Site was testing a PPS version of the BDII which was incompatible with the GStat tests. This resulted in the low site availability for July |
CN-BEIJING-PKU | 2/2 | |
IN-DAE-VECC-EUINDIAGRID | 1/1 | |
JP-HIROSHIMA-WLCG | 52/77 | |
NCP-LCG2 | 61/97 | |
TH-HAII | 0/0 | |
TW-NTCU-HPC-01 | 59/59 | |
Taiwan-IPAS-LCG2 | 10/10 | |
CERN | ||
UFRJ-IF | 57/59 | The problem is linked to the network issue the site experienced in June since the site internal monitoring system gives 99.595% of uptime and the site non-EGEE regional nagios says 37.55%. In July, the site also discovered a glitch on the monitoring system related to the fact that the site non-EGEE regional nagios uses hard coded IP addresses and the site changed from one network to another. So, the site would expect an availability of 68% and the EGEE monitoring system measured 57%. The site believes the difference is due to a different methodology. |
France | ||
IPSL-IPGP-LCG2 | 52/96 | This is a small site mainly supporting ESR with only one site administrator. In accordance with the main VO and the ROC, the site was put into downtime during the site administrator's vacations. |
GermanySwitz. | ||
LRZ-LMU | 67/67 | LMU-LRZ had a bad SRM failure, corrupt system disk and pnfs Db. This took up to 10 days to repair(from 5th July), hence the 67% availability in July. |
SWITCH | 68/68 | As for the performance of the site in July and August this was mostly due to the fact that the installation of the June EUGridPMA bundle (and later versions) was delayed consistently at the partner sites: our CE certificate is in fact a new SWITCHpki grid server certificate (issued by QuoVadis) which was installed last June and for quite some time the SAM tests were failing because the QuoVadis CA was not recognized by the various central grid nodes such as the WMS etc. |
Italy | ||
CNR-ILC-PISA | 66/92 | two unscheduled downtime due to electrical problems, quickly announced on GOC-DB (as attested by reliability value). After the second electrical problem, the ntp daemon didn't start, and we found out that an update performed in the previous days changed the SELinux policies, and by default ntpd was blocked |
CYBERSAR-CAGLIARI | 16/37 | |
CYBERSAR-PORTOCONTE | 55/69 | site enterd in production on Jul 27th, the replica test on CE failed due to powercut of the BDII (published by CYBERSAR-CAGLIARI site) set on the WNs |
INFN-NAPOLI | 46/51 | |
INFN-NAPOLI-PAMELA | 57/63 | several power supply problems caused unexpected power cut |
SPACI-LECCE | 39/50 | hardware problem on SE |
NorthernEurope | ||
ITPA-LCG2 | 5/28 | First we had a suspicious activity which might have been a security incident, but we had no further prove of it. To be on the save side we have decided to reinstall our stack from scratch with the latest software. Later we had a hardware failure on the server hosting virtual machines. |
KTU-BG-GLITE | 62/62 | GocDB was unusable at times thus I was unable to properly register downtimes. Our site was functional, except for the accounting data not being published. In terms of availability it was providing services to users most of the time.The system administrator was also on vacation during this period. |
KTU-ELEN-LCG2 | 73/73 | Thats because of vacation periods. Also increased network load (new user jobs, our SE used to keep job results from different clusters, Apel instablity, BDII timeouts) maybe have contributed to negative metrics. |
Russia | ||
RRC-KI | 67/80 | |
RU-SPbSU | 57/64 | |
ru-PNPI | 68/69 | |
SouthEasternEurope | ||
GR-07-UOI-HEPLAB | 51/92 | CE node was down due to power supply failure, from 2009-07-17, 12:00:00 [UTC] to 2009-08-02, 14:07:00 [UTC] (see relative link![]() |
RO-15-NIPNE | 71/74 | Cooling system failures (see link![]() ![]() |
TR-05-BOUN | 59/75 | The TR-05-BOUN site had been moved from the South Campus to Kandilli Campus of Bagazici University as a results it was in maintenance at beginning of July (see relative link![]() |
WEIZMANN-LCG2 | 69/69 | |
!SouthWestEurope | ||
ESA-ESAC | 46/46 | |
IEETA | 67/77 | |
LIP-Coimbra | 53/97 | |
MA-01-CNRST | 43/43 | |
NCG-INGRID-PT | 26/92 | |
e-ca-iaa | 53/69 | |
UK/I | ||
UKI-LT2-IC-HEP | 69/81 | |
UKI-LT2-QMUL | 71/71 | |
UKI-LT2-UCL-CENTRAL | 5/33 |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
Australia-ATLAS | 0/0 | Because of network latency, site was testing a version of the BDII which was incompatible with the GStat tests. This resulted in zero availability for June, even though the site was OK |
CN-BEIJING | 1/1 | |
HK-HKU-CC01 | 55/55 | |
IN-DAE-VECC-EUINDIAGRID | 36/36 | |
JP-KEK-CRC-01 | 0/0 | |
KR-KISTI-HEP | 23/27 | |
MY-UPM-BIRUNI-01 | 4/4 | |
PAKGRID-LCG2 | 49/78 | |
TH-HAII | 37/37 | |
Taiwan-IPAS-LCG2 | 0/ n/a | |
CERN | ||
ALBERTA-LCG2 | 55/55 | |
UFRJ-IF | 69/76 | The site is having network problems. The site Nagios shows good results. They are negotiating a better connection |
Uniandes | 52/59 | |
CentralEurope | ||
BY-UIIP | 34/52 | Main site admin was on holidays and the backup admin didn't answer. During that time the problem with missing libraries on WNs appeared and also SE misconfiguration. Resume: procedure for delegating administrator rights of the site admin hasn't been developed and practiced at UIIP. The fault is on the main site administrator. |
egee.irb.hr | 44/81 | Site has repeated problems with cooling system |
France | ||
GermanySwitz. | ||
UNI-FREIBURG | 66/66 | In the first two week of June we had sever cooling problems resulting from a malfunctioning climate control in our computing center. In addition we suffered power cuts that resulted in total failures of our hardware. In the meantime we have overcome these technical problems and are running stable. |
Italy | ||
CNR-ILC-PISA | 46/82 | After the application of Update 45 and 46, in our SE (dpm) was present a problem of dependency with library glite-info-dynamic-dpm. For some days the SE don't published the GlueSAPath variable and the tree GlueSEUniqueID. After ten days of attemps with the help of IT CMT, we have reinstalled the SE as new and only when solved other problems on some x86_64 packages the SE has reply well to SAM tests |
CYBERSAR-CAGLIARI | 69/83 | |
INFN-CATANIA | 69/69 | globus-job-manager-marshal service doesn't delete unused files under /opt/globus/tmp/gram_job_state/ Up to 20000 files, job submission is ok, over that number it starts to have problems. I delete by hand every 2 days a lot files, but it's not a good solution for a site manager. Development team released 3-4 versions of globus-gma rpm but the problem is the same. Our availability e reliability was under 70% only during the first week of June because starting from second one, I fixed by hand. |
INFN-GENOVA | 65/65 | |
INFN-MILANO | 71/71 | The main source of errors in the SAM tests in the month of June is the actual Storage Element, which it is a dpm srm storage element. As short term solution we are going to upgrade the dpm server hardware in the next days, as mid term solution we are going to replace DPM based SE with STORM based SE |
INFN-NAPOLI | 62/64 | problems with CE file system due to /opt/exp_soft directory, finally moved to a new disk. the solution wasn't rapid due to poor manpower |
INFN-PARMA | 47/72 | On June 17 we got an authentication problem on our Storage Element Storm v 1.3. The problem arose only for users belonging to sgmops group. The effort to solve the problem was unsuccessful so we had to install the server from scratch with the newer version of Storm 1.4. |
INFN-ROMA1-CMS | 68/72 | In June 2009 INFN-ROMA1-CMS hoped to have reached a good level of stability but this proved not true due to continuous flips of the CE. So we took the decision of reinstalling the middleware on the CE This process was long and painful: lack of complete, up-to-date and understandable documentation, specifically on the variables in siteinfo.def, let us loose several days of availability, despite help from our ROC. So, between this process and previous CE glitches, we lost almost 10 days in June. Now the CE is fully reinstalled and we hope to have all components under control. |
INFN-ROMA1-VIRGO | 53/70 | failure happened on the CE services and the site manager could not intervene at once. Hence the site remained ~ 10 days in unscheduled downtime. |
SPACI-LECCE | 35/56 | Hardware problems on the Storage Element,changed hardware, and also IP and hostname. Some network problems that need further investigation. |
UNI-PERUGIA | 39/55 | Configuration problems related to the SE machine (se.grid.unipg.it) corrupted the whole UNIPG-SITE (tkt starting date 2009/6/16 10:58:32). The sgmops user Judit Novak was not able to authenticate itself in the machines belonging to UNIPG-SITE. The problem has been solved (tkt closing date 2009/7/1 9:17:03) by adding more sgmops pool accounts. Due to the unpredictable nature of the problem we was not able to set a scheduled downtime. |
NorthernEurope | ||
ITPA-LCG2 | 54/54 | We had a security incident, and now are upgrading/reinstalling the servers. The site is in unscheduled downtime now. |
PHILIPS-TGRID | 68/68 | We used to add all the VOs we support (on HTC-BIGGRID) also to PHILIPS-TGRID. This meant that jobs (that did not have much requirements) could also be queued in PHILIP-TGRID. Some VO's solve this by adding their own software tags to classify a site as 'OK' to run jobs, but not all VO's do this. In the end we removed all non-infra VOs from PHILIPS-TGRID. - pbs/tmpdir was not ideally configured on our WN's, so it filled up very fast - we reconfigured the WN's. - We did not give priority to solve problems on PHILIPS-TGRID (because it is a testsite). |
VGTU-gLite | 59/80 | VGTU-gLite was in maintenance status for long time in June and then after it was reinstalled with new OS and started to receive the SAM tests one of the WN's was randomly failling because of the I/O errors on HDD and it was hard to detect such problem as we dont have hard disk monitoring tool in place yet. That WN was removed from the site until the HDD will be replaced with the new one. Now we have monitoring for all hard disks of WNs using 'smartmontools' tool and we run various tests once per week to prevent such problems appear in the future. |
Russia | ||
RRC-KI | 69/72 | 1. There was some problem with air-conditioners. As a result the unscheduled downtime deal with SE was. 2. Because STEP'09 stress test of Atlas some tuning of CE and SE have been done. 3. The network connection of site with Europe was down during a few days. |
IPCP-LCG2 | 57/57 | |
ru-PNPI | 65/66 | This site actively participate in STEP'09 stress test. In particular ATLAS part. Unfortunately, some bottleneck in the internal network lead to the site crash during the test. The problem was investigated and the local network is under modification. |
SouthEasternEurope | ||
GR-05-DEMOKRITOS | 35/84 | Several problems with power supplies from previous months. It seams that the power supply problems doesn't exist any more. We bought a new server to act as a cluster controller and we are at the face of reconfiguring the site. |
RO-07-NIPNE | 60/62 | We have a problem with our Cooling system during this month so the cluster had stop it very often |
TR-05-BOUN | 63/68 | The DNS server of Bogazici University was changed and had problems with reverse DNS records of TR-05-BOUN. After that problem, TR-05-BOUN was in maintenance. The site had been moved from the South Campus to Kandilli Campus of Bogazici University. At the same time, its CE and SE had been upgraded |
SouthWestEurope | ||
UNICAN | 65/65 | |
UK/I | ||
UKI-LT2-UCL-CENTRAL | 13/18 | |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
Australia-ATLAS | 0/0 | Because of network latency, site was testing a version of the BDII which was incompatible with the GStat tests. This resulted in zero availability for May, even though the site was OK |
CN-BEIJING-PKU | 7/9 | |
HK-HKU-CC-01 | 54/54 | |
IN-DAE-VECC-01 | 20/21 | |
JP-KEK-CRC-01 | 43/43 | |
KR-KISTI-HEP | 61/69 | |
MY-UPM-BIRUNI-01 | 0/0 | |
NCP-LCG2 | 35/59 | |
TW-NCUHEP | 66/66 | |
Taiwan-IPAS-LCG2 | 0/N/A | |
CERN | ||
TORONTO-LCG2 | 55/55 | seems to be getting back into shape and I have checked they are fine for June, the May figure was in any case on an upward trend and I know they had dCache issues. |
UFRJ-IF | 53/61 | seems to have had an issue with the site BDII. I have sent an e-mail as you know but they look much better in June |
CentralEurope | ||
PEARL-AMU | 68/68 | |
egee.irb.hr | 18/38 | |
France | ||
GermanySwitz. | ||
GSI-LCG2 | 68/68 | system is Debian linux which was not completely supported by the gLite middleware. No detailed news about improvement or comment for the month related to A/R monitoring |
MPI-K | 62/78 | Due to the problematic tickets 47872, 47920, 47952 opened last month which have been closed with SOLVED in May. There has been no problems since last 30 days with the site according to SAM. |
TUDresden-ZIH | 17/32 | May was the month we set up our site. Our site was set to production state though we still had problems in the initial setup. The problems took us some time to fix. |
UNI-BONN | 68/88 | We were on a unscheduked downtime from 2009-04-20 at 8:00 to 2009-05-17 at 8:16:00 to increase the storage capacity and install new dCache. There were two extensions after the initially planned 4 days intervention since we had some problems with the new dCache installation. This was finally solved. We also had sam tests failing on a weekend later in the month due do some problems with spacetoken configurations needed for ATLAS VO but affected the storage globally. |
Wuppertalprod | 16/17 | We upgraded our dCache instance and lost the complete pnfs database doing so. Support was very limited during that time. dCache.org people tried several time to save to data base, at the end we had to give up and install from scratch with all data lost |
Italy | ||
CNR-ILC-PISA | 64/89 | |
CYBERSAR-CAGLIARI | 42/60 | problem with globus-gma processes |
INFN-BARI | 58/75 | |
INFN-FERRARA | 62/68 | it is back in production May 8th, after a reinstallation of the farm and a supporters change; tests failed for configuration problems |
INFN-MILANO-ATLASC | 69/70 | new site in production from May 18th: there were authentication failures relating ops users, fixed after several days of hard debugging |
INFN-NAPOLI-PAMELA | 66/80 | |
INFN-ROMA1-CMS | 65/65 | The site is now fully operational. However in May we had two problems One was related to trying to disable an unsupported VO which resulted in a major misconfiguration on the CE. This took away 4 days, happening before a long weekend. The other was related to a misconfiguration on the SE which again took away about 4 days to resolve |
INFN-ROMA3 | 41/46 | The site had some problems with storage during the month of May. At the beginning of the month we added some disk to a couple of volumes and we had to restripe them for proper balancing (we are using GPFS): this had an impact on our Storm SE, which started timing-out, so we decided to declare an unscheduled downtime while the operation was ongoing. Around the 20th of the month, the addition of a new disk-server broke GPFS for a few hours. Throughout the month, there have been intermittent problems, mainly due to the SE. For the above reasons, we decided to reinstall our Storm SE, and this happened during the scheduled downtime around 8th June. Since then, the site is performing well. |
INFN-TRIESTE | 25/45 | problems with globus-gma and on the storage element (STORM) |
SISSA-TRIESTE | 64/64 | Problem with globus-gma and 2 unscheduled energy blackout. |
SNS-PISA | 60/64 | The availability problems of SNS-PISA Grid node have been related to a large job submission from some VO in the past months, with an overload of the CE and BDII (running on the same hardware). The problem seems to be solved introducing a queuable maximum job limit via PBS three weeks ago. |
SPACI-LECCE | 57/74 | unexpected hardware failure on the storage element, it has been moved on another machine |
NorthernEurope | ||
VGTU-gLite | 44/44 | At this moment VGTU-gLite is in downtime, I think we always had some problems in the last half of a year because of SLC3 OS (which is long time not maintained anymore + middleware) which we run for CE/SE - now the system is in reinstall status and we'll bring it up as soon as possible with new fresh install of gLite middleware and new version of SLC4. |
Russia | ||
JINR-LCG2 | 39/94 | JINR is one of the main site in Russia. It is doing some reconstruction of networks facilities without downtime. Unfortunately the reconstruction required more time then expected. I hope that JINR will stabilize in June. |
Kharkov-kipt-lcg2 | 53/83 | Kharkov site has some troubles with network connectivity with Europe. Because this site is in Ukraine we have not possible to help them at this point |
SouthEasternEurope | ||
CY-01-KIMON | 63/63 | the reason that our site had a low availability was the same with the previews month, the site-bdii. As I wrote on my previews report we had lcg-CE TORQUE-server and TORQUE-utils and site-bdii installed on the same machine. The problem appeared by the end of April, when we supported new vo (lhcb) and we had it untill the first week of May when we removed the site bdii in a new machine. Since there this problem has deseapeard and site-bdii is running sucessfully. We will do our best for our site to meet the specified criteria. |
GR-05-DEMOKRITOS | 7/99 | Hardware problems mainly with power supplies |
MK-01-UKIM_II | 63/65 | |
RO-13-ISS | 71/74 | RO-13-ISS had connectivity problems and UPS failures, which affected the site, as the power offs were often. It still has to replace the UPS batteries from the vendor with new ones, and there could be some problems in the following month too. But the site is registering this downtime in GOCDB and reliability should be higher. |
RO-15-NIPNE | 67/69 | RO-15-NIPNE had a problem with specific LHCb software installation using SLC5 and gcc4.3 as it is presented here. It seems that what was presented there was not functional for them, and so they were failing more tests, but they have now reverted to slc4.7 gcc3.4, and the site is functional. |
SouthWestEurope | ||
BIFI | 67/85 | |
ESA-ESAC | 68/68 | |
UB-LCG2 | 32/35 | |
UPV-GRyCAP | 48/49 | |
e-ca-iaa | 62/62 | |
UK/I | ||
UKI-LT2-UCL-CENTRAL | 23/25 | 1.Lustre file system slowdown - should now be fixed 2.CE "Funnies" lead to CRL's not downloading reliably. Also peculiar behavior at shell prompt. Reboot fixes for a while. 3.Proxy timeouts caused by very full cluster. Have restricted ops jobs to 15 mins so they backfill and boosted priority. 4.CE appear to be overstretched this causes 2 problesm. a)OOM killer kicking in and killing stuff b)Downloading of payload from CE is very slow causing SAM tests to timeout (usually during CAVER test). The main underlying problem is an overloaded CE which is about to be upgraded (fixes 2 and 4). |
Region | A%/R% | Reason |
---|---|---|
AsiaPacific | ||
CERN | ||
CentralEurope | ||
France | ||
GermanySwitz. | ||
Italy | ||
NorthernEurope | ||
LatinAmerica | ||
Russia | ||
SouthEasternEurope | ||
SouthWestEurope | ||
UK/I | ||