Comments on Monthly Site Availability/Reliability Reports

Introduction

The EGEE League Table is produced every month and gives the availability and reliability of all certified production sites. The current version of the ROC-Site Service Level Agreement stipulates that the minimum tolerated availability is 70%, and that reliabilty should be at least 75%. After each report is published, ROCs are requested to follow-up with sites that are underperforming (i.e. all those that are not in green) in order to provide an explanation to SA1 management.

This Wiki page is a repository for the monthly "excuses" of poor site performance. Note that sites which have an availability of less than 50% for three consecutive months will be removed from the Production infrastructure.

Site administrators and ROCs should ensure that they've read the top-tips and explanations page. Another useful document to know about is the one that details how GridView does the availability calculations.

2010

April 2010

The April League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific
ID-ITB 63/63 No Downtime declared - DCP
JP-HIROSHIMA-WLCG 65/65 No Downtime declared - DCP
MY-UM-CRYSTAL 71/71 random failures across the month - DCP
MY-UPM-BIRUNI-01 24/24 No Downtime declared - DCP
MY-UTM-GRID 0/0 no No Downtime declared - DCP
TW-NCUHEP 61/61 No Downtime declared - DCP
VN-HPCC-HUT-HN 2/2 Unscheduled downtime of 7 days to check network connection - DCP
VN-IFI-PPS 38/38 Unscheduled downtime of 6 days to put underground internet lines at Vinaren - DCP
CentralEurope
BY-JIPNR-SOSNY 69/69 Network problems - DCP
FMPhI-UNIBA 73/73 unscheduled downtime during 9 days due to site configuration problems - DCP
GermanySwitzerland
GSI-LCG2 73/73 Unscheduled downtime of 8 days to move nodes and reconfigure services - DCP
GoeGrid 74/74 Unscheduled downtime of 1 hour due to power outage, name server problems
TUDresden-ZIH 68/68 Unscheduled downtime of 1 day and 18 hours due to unknown problem with dCache
UNI-FREIBURG 73/73 random failures across the month - DCP
UNI-SIEGEN-HEP 73/73 No Downtime declared - DCP
wuppertalprod 60/60 site failures during 12 days without downtime declared - DCP
Italy
INFN-BARI 52/52 The problem observed in last two weeks were mainly due to 2 different problems: 1) A couple of arrays were lost due to multiple concurrent disk failures. This leeds us to have problems on SAM SRM related test. 2) A problem on the Local batch system server. We had to switch on a new machine in order to solve this problem. We assume that now the problems are both solved, so we expect a far better behaviour.
INFN-CAGLIARI 61/71 There was a host without lcg-CA updates and without lcg-voms.cern.ch on which gone to run almost all jobs test
INFN-GENOVA 71/71 fiber channel disk problem and storm configuration problem
INFN-NAPOLI-CMS 29/29 unforeseen electrical interruptions
INFN-ROMA3 29/56 Site was in downtime from 6th april through 21st april (two consecutive time slots). We received and installed two new racks, and a new air-conditioning system for such racks. Installation was not as smooth as expected and there were some delays. After the long downtime we had some issues with our Storm SE, in conjunction with FTS: we were flooded by Atlas jobs, which repeatedly failed and triggered new file copies until a couple of spacetokens filled up. At that point, we concentrated on solving this problem, which implied switching services on and off several times.
INFN-TRIESTE 62/99 WN migration to sl5, CE update, new STORM version installation and poolaccounts reconfiguration
NGI_PL
PEARL-AMU 0/0 CE problem - DCP
NorthernEurope
EENet 66/66 intermittent failures across the month - DCP
KTU-ELEN-LCG2 54/54 No Downtime declared - DCP
LSG-RUG 53/53 No Downtime declared - DCP
T2_Estonia 47/47  
UNIGE-DPNC 69/69 site failing for 10 days without downtime defined - DCP
ROC_LA
SAMPA 0/0 Site not tested during most of the month - DCP
Russia
RU-Phys-SPbSU 2/2 no No Downtime declared - DCP
ru-IMPB-LCG2 63/63 site failing for 11 days without downtime defined - DCP
SouthEasternEurope
BG02-IM 70/70 intermittent failures across the month - DCP
BG07-EDU 0/0 Site not tested during most of the month - DCP
GE-01-GRENA N.A./N.A. Site not tested during most of the month - DCP
IL-TAU-HEP 64/64 intermittent failures across the month - DCP
MK-02-ETF 0/0 Site not tested during most of the month - DCP
RO-02-NIPNE 0/0 Unscheduled downtimes during the whole month. Cooling, storage & config problems reported - DCP
TR-01-ULAKBIM 53/53 intermittent failures across the month. No downtimes - DCP
TR-03-METU 48/48 intermittent failures across the month. No downtimes - DCP
TR-04-ERCIYES 69/69 intermittent failures across the month. No downtimes - DCP
TR-05-BOUN 63/63 intermittent failures across the month. No downtimes - DCP
TR-07-PAMUKKALE 69/69 intermittent failures across the month. No downtimes - DCP
TR-09-ITU 68/68 intermittent failures across the month. No downtimes - DCP
TR-10-ULAKBIM 66/66 intermittent failures across the month. No downtimes - DCP
WEIZMANN-LCG2 70/70 intermittent failures across the month due to SE & BDII problems - DCP
SouthWesternEurope
BIFI 55/55 intermittent failures across the month due to top-BDII & WMS problems - DCP
DI-UMinho 8/8 site failing most of the month without scheduled downtime - DCP
IEETA 37/44 intermittent failures across the month - DCP
MA-01-CNRST 74/74 unsched maintenance for 2 days and 8 more days with failures - DCP
RedIRIS_GILDA 37/37 No Downtimes declared - DCP
UMinho-CP 65/65 intermittent failures across the month. Unsched electric intervention during 1 day - DCP
UNICAN 54/54 No Downtimes declared - DCP
UPV-GRyCAP 74/74 No Downtimes declared - DCP
UKI
UKI-LT2-RHUL 0/N.A. Scheduled Downtime during the whole month to upgrade cluster to SL5 - DCP

March 2010

The March League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific
CN-BEIJING-PKU 0/0 no No downtime declared - DCP
ID-ITB 68/68 No downtime declared - DCP
JP-HIROSHIMA-WLCG 50/53 Hardware error in network switch - DCP
MY-UM-CRYSTAL 45/45 No downtime declared - DCP
MY-UPM-BIRUNI-01 66/66 Network & Core switch upgrades - DCP
MY-UTM-GRID 0/0 No downtime declared - DCP
NCP-LCG2 66/73 Network problem at ISP end. Config problems at DPM & WN. Testing services on new network. Power issues - DCP
TH-HAII 74/74 No downtime declared - DCP
TW-NCUHEP 54/54 No downtime declared - DCP
TW-NIU-EECS-01 40/40 no No downtime declared - DCP
VN-IFI-PPS 56/56 No downtime declared - DCP
VN-IOIT-HN 1/1 no No downtime declared - DCP
CentralEurope
FMPhI-UNIBA 14/26 Switch to gigabit link and reorganization of services - DCP
egee.grid.niif.hu 48/48 "full envionmental change"?. EMI transition preparation - DCP
GermanySwitzerland
GoeGrid 16/17 continued upgrades, filesystem problems, srm problems and defect switch
UNI-FREIBURG 20/20 NFS server problems. Software area unavailable - DCP
UNI-SIEGEN-HEP 45/45 Nagios test org.sam.CE-JobSubmit-/ops/Role=lcgadmin failed for a while (ticket 4461) but magically cured itself; reason unknown.
wuppertalprod 52/52 central file system failed. New network backbone will be installed - DCP
Italy
GRISU-CYBERSAR-CAGLIARI 57/57 Problem on local cluster and storage
GRISU-CYBERSAR-PORTOCONTE 45/45 problem on SE due to the lcg-voms certificate expired, problem with a bdii set by mistake on WNs
INFN-CAGLIARI 63/63 Maintenance on CE and some problems on few worker nodes on which we are investigating
INFN-GENOVA 66/93 Change CC cooling system and rearrange racks. Operations on UPS
INFN-PADOVA-CMS 64/64 problem related to a bug in the new STORM version, jobs submission problems (reinstallation of CE) and authentication problems due to few sgmops pool accounts
INFN-PERUGIA 14/24 Hardware upgrade & issues on SE
TRIGRID-INFN-CATANIA 62/70 UPS maintenance & Hardware problems - DCP
NGI_PL
PEARL-AMU 29/70 "maintenance"? - DCP
PSNC-GILDA 30/69 "Initial site DOWNTIME"? - DCP
NorthernEurope
KTU-BG-GLITE 41/41 no No downtime declared - DCP
ROC_Canada
CA-ALBERTA-WESTGRID-T2 59/83 Upgrade of power services to the main campus CC. WNs migration from SL4 to SL5 - DCP
ROC_LA
ICN-UNAM 41/58 Site down due to building restructuration next to CC - DCP
SAMPA 0/0 Power outage due to strong rains. Troubleshooting CEs - DCP
Russia
RU-SPbSU 73/73 Network broken - DCP
ru-Chernogolovka-IPCP-LCG2 71/71 Hardware upgrade - DCP
SouthEasternEurope
IL-TAU-HEP 39/50 reformatting & installing new storage - DCP
RO-02-NIPNE 73/73 cooling problems - DCP
RO-03-UPB 26/26 power failure, configuration issues - DCP
RO-15-NIPNE 66/66 No downtime declared - DCP
TR-10-ULAKBIM 48/48 No downtime declared - DCP
SouthWesternEurope
BIFI 27/32 Moving to a new building - DCP
IEETA 46/46 No downtime declared - DCP
MA-01-CNRST 0/0 Unscheduled maintenance during the whole month - DCP
UB-LCG2 32/49 Delay in RAID controller delivery. Power cut. GIP stopped publishing info after adding queues to Torque. - DCP
UNICAN 64/64 No downtime declared - DCP
UPV-GRyCAP 63/63 No downtime declared - DCP
UKI
UKI-SOUTHGRID-BRIS-HEP 65/65 No downtime declared - DCP
cpDIASie 66/66 No downtime declared - DCP

February 2010

The February League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific
CN-BEIJING-PKU 32/32 Down most of the month with no downtime declared - DCP
HK-HKU-CC-01 63/70 Down for 5 days with no downtime declared - DCP
INDIACMS-TIFR 13/13 Unscheduled downtimes due to upgrades to glite-WN 3.2, AC & electrical maintenances & Squid server HDD error - DCP
MY-UM-CRYSTAL 37/37 Down 12 days with no downtimes declared - DCP
PH-ATENEO 37/37 Down most of the month with no downtimes declared - DCP
TH-HAII 68/68 Down 5 days with no downtimes declared - DCP
VN-HPCC-HUT-HN 50/50 Down half of the month with no downtimes declared - DCP
VN-IFI-PPS 8/8 Down most of the month with no downtimes declared - DCP
VN-IOIT-HN 49/49 Down half of the month with no downtimes declared - DCP
CentralEurope
CYFRONET-LCG2 69/80 Unscheduled downtime of 5 days due to power maintenance, DPM reinstall & shared areas migration - DCP
IFJ-PAN-BG 60/72 Power maintenance - DCP
PSNC-GILDA 0/0 Site tested during just 2 days in the month, with failures - DCP
prague_cesnet_lcg2 72/72 Intermittent failures during the month - DCP
France
MSFG-OPEN 62/62 Site tested during just 8 days in the month, out of which 2 had failures - DCP. A new site with initial difficulties in the production environment - RR
GermanySwitzerland
GSI-LCG2I 68/68 Unscheduled downtime of 9 days due to power failure in UPS & cooling maintenance - DCP
GoeGrid 56/63 Unscheduled downtime of 6 days for hardware upgrades and dCache Upgrade
Italy
INFN-BOLOGNA 54/73 Unscheduled downtime of 2 days & scheduled downtimes during 6 days (added new WNs)
INFN-NAPOLI-CMS 57/70 Software upgrades & power cuts
NorthernEurope
EENet 70/70 intermittent problems during the month - DCP
IMCSUL 0/0 Site tested during just a few hours in the month, with failures - DCP
T2_Estonia 70/70 intermittent problems during the month - DCP
VU-MIF-LCG2 54/54 down 9 days, not tested 9 days and up 10 days - DCP
ROC_Canada
ALBERTA-LCG2 0/0 Site upgrade and migration to Tier3 continued during 20 days of sched downtime - DCP
CA-SCINET-T2 67/67 Unscheduled downtimes to reconfigure hostnames and grid-user home space on CE - DCP
ROC_LA
SAMPA 14/22 power failure due to heavy rains in Sao Paulo & cluster did not restart afterwards. Troubleshooting CE, CREAM-CE & SE - DCP
Russia
ITEP 40/40 Unscheduled downtimes due to disk and network problems plus middleware upgrade - DCP
ru-Chernogolovka-IPCP-LCG2 22/22 No downtimes declared. Down during 22 days - DCP
ru-Moscow-MEPHI-LCG2 0/0 Unscheduled downtime during 89 days due to "General infrastructure problem" - DCP
SouthEasternEurope
GR-04-FORTH-ICS 69/69 Unscheduled downtime during 6 days to upgrade to SL5.3 and gLite 3.2 WNs - DCP
HG-06-EKT 59/88 Upgrade of worker nodes to SL5/gLite-3.2 & migration from GPFS to Lustre - DCP
IL-TAU-HEP 57/57 Problem on Lustre - DCP
SouthWesternEurope
BIFI 68/78 Scheduled downtime during 8 days to move to a new building - DCP
DI-UMinho 14/14 Unscheduled downtimes during 11 days due to updates in CE certificate - DCP
MA-01-CNRST 11/11 Unscheduled downtime of 21 days due to maintenance of the site no - DCP
UB-LCG2 0/0 RAID controller replacement for the users home died and has to be replaced - DCP
UKI
UKI-LT2-UCL-CENTRAL 39/65 Scheduled donwtime during 10 days due to 'lfsck'ing Lustre' - DCP
UKI-SOUTHGRID-BRIS-HEP 0/0 Scheduled downtime during the whole month due to DPM retirement & lcgce04's WN configuration - DCP
UKI-SOUTHGRID-RALPP 74/74 Unsched downtime due to problems with air conditioning - DCP

January 2010

The January League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific
IN-DAE-VECC-01 32/32 Annual Maintenance of Cooling solution (unsched) - DCP
IN-DAE-VECC-02 69/69 Annual Maintenance of Cooling solution (unsched) - DCP
INDIACMS-TIFR 7/7 network outage (unsched), system disk fail on DPM disk server (unsched) - DCP
MY-MIMOS-GC-01 64/64 hadware failure (unsched) - DCP
MY-UM-CRYSTAL 67/67
NCP-LCG2 48/48 CE issues for 14 days (unsched) - DCP
PAKGRID-LCG2 17/17 internet link down during 70 days (unsched no) - DCP
PH-ATENEO 31/31
TH-HAII 57/57
TW-NCUHEP 69/69
VN-HPCC-HUT-HN 67/67
CentralEurope
PEARL-AMU 74/74
GermanySwitzerland
SCAI 57/57 timeout requests to DPM for 6 days (unsched) - DCP
Italy
INFN-CATANIA 72/72 unexpected problem with the giga-pop connection to GARR
INFN-FERRARA 72/72 power supply problem for some days
NorthernEurope
BEgrid-KULeuven 46/46
VU-MIF-LCG2 2/2 Reinstall WN from SLES to SLC. Cooling problems. Whole month (unsched) - DCP
ROC_Canada
ALBERTA-LCG2 14/29 Site upgrade and migration to T3 during 8 days (unsched no) - DCP
CA-SCINET-T2 61/68 Sched downtime to move & reconfig dCache disks & softw updates - DCP
SDU-LCG2 70/70
ROC_IGALC
UFRJ-IF 61/61 Problem with DPM & powercut (unsched) - DCP
ROC_LA
UNIANDES 66/73 Power failure, all personal on vacation (unsched for 8 days) - DCP
Russia
ru-IMPB-LCG2 53/53
ru-Moscow-MEPHI-LCG2 0/0 Severe infrastructure problem (unsched for 25 days) - DCP
SouthEasternEurope
IL-TAU-HEP 55/55
RO-11-NIPNE 53/53 Internal network & hardware problems during 11 days (unsched) - DCP
SouthWesternEurope
MA-01-CNRST 59/59 Maintenance of the site during 21 days (unsched no) - DCP
UKI
UKI-LT2-UCL-CENTRAL 72/72 The batch scheduler (moab) runs from lustre and is often slow to dispatch the ops jobs and job output retrieval is also problematic (also lustre related). The lustre filesystem has since had a thorough check.
UKI-SOUTHGRID-OX-HEP 58/75 Air cond. failure, 6 days (unsched), 7 days (sched) - DCP
UKI-SOUTHGRID-RALPP 64/94 Outage building power supply & reconfig dCache - 10 days (sched) - DCP

2009

December 2009

The December League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific
ID-ITB 64/64
IN-DAE-VECC-02 63/63 Site update and Cream-CE upgrade for which no scheduled downtime was declared - DCP
INDIACMS-TIFR 8/8 Networking issues during the whole month for which no scheduled downtimes were declared - DCP
MY-MIMOS-GC-01 52/52 Power failures during 13 days for which no scheduled downtimes were declared - DCP
NCP-LCG2 48/48 Deploying, testing & troubleshooting new WNs and CE to SL5 during 2 weeks without scheduled downtime - DCP
PAKGRID-LCG2 45/45 Hardware problems during 21 days without scheduled downtime - DCP
PH-ASTI-BUHAWI 46/46
PH-ATENEO 47/47
VN-HPCC-HUT-HN 67/67
VN-IFI-PPS 51/51
VN-IOIT-HN 63/63
CentralEurope
BY-BNTU 0/0 Site created on mid December but received no tests - DCP
France
IPSL-IPGP-LCG2 59/97 Scheduled downtime of 12 days for CE winter break followed by some reconfig - DCP
M3PEC N.A./N.A. Existing site, but no tests received during the whole month - DCP
GermanySwitzerland
TUDresden-ZIH 61/77 Network and batch problems after software update on the machine during 7 days with unscheduled downtimes - DCP
UNI-SIEGEN-HEP 59/59 Long debugging of SAM test failure (CE-sft-job) which didn't indicate any apparent reason and appeared erraticly. Finally fixed. (GUUS-Ticket 53960) - WW
NorthernEurope
VGTU-TEST-gLite N.A./N.A. Existing site, but no tests received during the whole month - DCP
ROC_Canada
CA-SCINET-T2 52/54 Main core network switch power supply died and no tests during first half of the month. New site or site name changed? - DCP
TORONTO-LCG2 2/2 Unscheduled downtime during the whole month. Migration to new Tier-2 site - DCP
ROC_IGALC
CEFET-RJ 69/69 Unforeseen power cut during 2 days with unscheduled downtime declared - DCP
UFRJ-IF 58/58 Globus job manager problem, unexpected power cut, unscheduled maintenances, DNS problems on campus - DCP
ROC_LA
SAMPA 41/52 Air conditioning off, so cluster down from Dec. 28th - DCP
UNIANDES 44/60 Sched site maintenance of 5 days plus unsched power failure since Dec. 28th - DCP
Russia
ru-IMPB-LCG2 63/63 Several unscheduled maintenances declared - DCP
ru-Moscow-MEPHI-LCG2 41/41 Severe optical backbone damage outside the site (unsched 7 days) plus severe infrastructure problem (unsched, 11 days) - DCP
ru-PNPI 54/54 Unsched downtime of 7 days for a migration to SL5 - DCP
SouthEasternEurope
RO-09-UTCN 58/58 Lots of random CE-sft-lcg-rm errors due to timeouts on our central BDII server.
RO-11-NIPNE 59/59 Unsched downtime of 10 days due to Internal grid networking problems, updates for SL5 - DCP
TR-03-METU 68/68 The site DPM server (eymir.grid.metu.edu.tr) had max loads in December, it used all its memory and some swap area. The usage of swap area causes connection timeouts in DPM services. Its memory had been extended. After upgrading the memory size, site services begun to work stable
SouthWesternEurope
UNICAN 73/73
UKI
UKI-LT2-UCL-CENTRAL 32/69 Sched downtime for 17 days for awaiting, testing & deploying new kernel from vendor plus unsched downtime of 4 days to reboot nodes and push out kernel across the cluster - DCP
UKI-SOUTHGRID-OX-HEP 72/72 Unsched downtime for 5 days due to problem at batch system of shared cluster plus catastrophic air cond. failure - DCP

November 2009

The November League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific
CN-BEIJING-PKU 50/50
INDIACMS-TIFR 45/46 Network problems between TIFR and all T1 other than CERN. Investigations by DANTE/GEANT on-going. Site should have declared 24hr unscheduled, and scheduled the rest of the downtime (but didn't) - JRS
JP-HIROSHIMA-WLCG 66/66 Unscheduled 14-day system migration to SLC5 -JRS
MY-UM-CRYSTAL 37/37
PAKGRID-LCG2 73/73 Hardware problems, but no judicious use of downtime declarations -JRS
PH-ASTI-BUHAWI 25/25
TW-NCUHEP 72/72
TW-NIU-EECS-01 68/68
VN-IFI-PPS 70/70
VN-IOIT-KEYLAB 63/63
CERN
CEFET-RJ 70/70 City's power system went offline, plus m/w and security issue -JRS
SDU-LCG2 74/74
TORONTO-LCG2 74/74 14-day downtime declared which could have been "scheduled" but wasn't -JRS
UFRJ-IF 70/70 Severe power failures -JRS
CentralEurope
BY-UIIP 56/56 Unscheduled(!) move to SL5 and gLite 3.2 with new hardware, followed by BDII and WMS problems (GGUS-54007)
PEARL-AMU 56/56 updated vulnerable kernel on all nodes, NFS-lock and DNS problems -JRS
egee.grid.niif.hu 68/68 "The site interfacing operational problem", err what does that mean? -JRS
France
IN2P3-IRES 72/72 important upgrade following a security vulneralibity impacting some services coupled with temporary lack of site admin manpower slowing down global reactivity
GermanySwitz.
BMRZ-FRANKFURT 0/0 OK, new site which appeared at end of the month
GSI-LCG2 26/26 Unscheduled(!) 10-day reconfiguration of some services -JRS
SWITCH 68/68 we had a frequent and severe time skew problem on our vmware machines. We recently set up a cronjob that synchronises the time every 10 seconds and since then we are consistently passing the sam tests. Furthermore, we reinstalled several times our WNs as we are now participating in the ARGUS delpoyment pilot(coordinated by CERN)
TUDresden-ZIH 43/46 Problems after upgrade of dCache -JRS
Italy
GRISU-COMETA-ING-MESSINA 54/72 departmental internet link failure (unscheduled downtime from 5th to 6th Nov + scheduled downtime from 7th to 14th Nov), recurrent (twice daily, on average) SRMv2 failures caused by Gridmap bug, as per GGUS 52668 (affecting the site since the last week of October, as soon as site got certified and in production, until the end of November, when the bug affecting Gridmap got fixed), sporadic CE sft-job failures, yet long lasting (usually tested again quite a few hours later), caused by globus-gma overload, now fixed (affecting last week of November). [low figures were partly due to a problem with SAM framework resulting in site disappearing from tBDII. Not site's fault - JRS]
GRISU-CYBERSAR-CAGLIARI 62/62 Network configuration problems not covered by a declared downtime - JRS
INFN-CATANIA 74/74
INFN-FRASCATI 56/76 site maintenance and storage system failure (unscheduled+scheduled downtime from 16th Nov to 27 Nov) - In November we scheduled a long downtime (4 days), in order to perform many operations (among which move the farm to a new VLAN, install some new FC switch, move WN from SL4 to SL5, etc...), but we met many problems hardware and software problems and we had to extend the downtime . When we eventually finished the work (after 4.5 days), we reboot all the machines and one HP blade chassis had an hardware failure. This happened on Friday afternoon, after 5 PM, so HP support was already unavailable and we couldn't call them before Monday morning (then we put an unscheduled downtime for 4 days). Moreover this failure was difficult to detect even for HP technicians, so we spent 2 days investigating and exchanging log files with them. The conclusion was that there was a failure on the blade chassis and on any blade server, so we had to wait for 2 additional days to have spare parts. Two of these blade servers were the DPM storage element and one of the DPM disk pool, so it was not possible to remove the downtime before to solve the failure. We've just bought a new storage element (a single unit server not part of a blade), so we do not expect this problem again.
INFN-NAPOLI 50/50 site maintenance and software upgrade
INFN-NAPOLI-ARGO 70/70 Problems in storage server
INFN-NAPOLI-PAMELA 34/57 hardware maintenance (unscheduled+sheduled downtime from 5th to 24th Nov)
SNS-PISA 61/61 the reason of the low availability of the SNS-SITE for this month is due to some new configuration and hardware update.I just realized - and probably I just stabilized - a configuration with a strict map "node - queue/VO": the users belonging to a specific VO are able to use only a set of reserved nodes, without any kind of fair share over them. This is required for a particular, and probably temporary, requirement of some internal users.The testbed should be at the moment OK, so I hope the site availability will increase in the next few days.
LatinAmerica
SAMPA 3/15 Scheduled migration from gLite 3.1 to 3.2, plus major power problems in Brazil -JRS
Russia
RU-Novosibirsk-BINP 5/7
RU-Phys-SPbSU 0/0 DomainName registration(delegation) problem -JRS
ru-IMPB-LCG2 62/82 Scheduled maintenance -JRS
SouthEasternEurope
BG-INRNE 74/74
BG02-IM 67/67 Unscheduled 21 days of hardware tests and repairs -JRS
IL-TAU-HEP 43/45 Storage problems following building move -JRS
MK-01-UKIM_II 69/69 Various file-system problems plus node migrations which were not scheduled -JRS
RO-15-NIPNE 73/73
TECHNION-HEP 74/74 Storage system problems -JRS
TR-03-METU 50/55 A scheduled downtime for migrating to all WNs to SL5.3 and gLite 3.2.
TR-09-ITU 58/58 The site had two different problems in November. One of them is that hardware problem was detected on computing element and had been fixed. The other is the air conditioner problem in the system room of ITU.
TR-10-ULAKBIM 74/74 Unscheduled downtime because of the changes on electricity structure of system room
SouthWestEurope
BIFI 32/32 Network outage at university, missconfigured authetication in CE and WMS -JRS
LIP-Coimbra 66/73 Scheduled building power cut, plus problems with Lustre 1.8.1.1 upgrade -JRS
UK/I
UKI-LT2-UCL-CENTRAL 53/63 Proper use of downtimes! Scheduled m/w upgrade, plus emergency work on air conditioning ducts -JRS
UKI-SOUTHGRID-BHAM-HEP 67/67

October 2009

The October League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific
INDIACMS-TIFR 60/60 network problems causing timeouts of transfers and SAM tests - JRS
JP-KEK-CRC-01 57/57 13 day unscheduled downtime for "Removal work for obsolete tags glite 3.0" - JRS
MY-UM-CRYSTAL 36/36 Poor quality of external networking at campus. Tracking intermittent outage issue with site admin
MY-UPM-BIRUNI-01 66/67 9h scheduled downtime for "Power maintenance", but rest is unexplained -JRS
PH-ASTI-LIKNAYAN 49/49 OK, site appeared during the month (18 Oct) - JRS
TH-NECTEC-LSR 69/69 hostcert of CE expired early in Oct during a holiday break
TW-NCUHEP 18/18 4 days unscheduled downtime for core grid services and computing farm upgrade. Worked with site admin and majority of site services have been recovered end of Oct.
TW-NTCU-HPC-01 71/71 SE host crashed (1 day outage), plus 1d17h unscheduled power maintenance - JRS
VN-IOIT-KEYLAB 46/46 DNS failures, site admin have limited effort recovering the problem and still waiting confirmation from local NOC people
CERN
CEFET-RJ 57/57 OK, site appeared during the month (20 Oct) - JRS
SDU-LCG2 42/42 Change vulnerable WNs, APEL & CE-sft-job failures, problems with the service CE-sft-lcg-rm -JRS
TORONTO-LCG2 44/44 One disk server had been lost and not been found for two weeks
CentralEurope
BY-UIIP 19/19 Lots of unscheduled downtimes: Problems with accounting and SE, network problems, h/w upgrade, moving to SL5 and glite3.2 (unscheduled!) -JRS
FMPhI-UNIB A40/40 CE&LRMS problem, plus a large unscheduled downtime due to emergency security shutdown -JRS
TASK 67/68 unscheduled network reconfiguration and CE backup restoration -JRS
GermanySwitz.
GSI-LCG2 69/69 unscheduled power-cut in main computing centre, plus service reconfiguration
GoeGrid 71/73 Hardware defect of network interface, plus scheduled maintenance
SWITCH 52/52 We indeed had various problems in September which we could only address a month ago. However the other sites did NOT timely update to the June IGTF release which contained the new QuoVadis Grid CA, hence resulting in our site not been trusted for two months or so.
TUDresden-ZIH 74/74 there were several reasons. One was a sudden power failure at the university campus. And then we had some network problems which let the SAM tests fail from time to time (and without any known reason)
UNI-SIEGEN-HEP 69/69 Debugging CE problems with main admin absent-JRS
Italy
GRISU-CYBERSAR-PORTOCONTE 74/74 site back in production on15th Oct after having changed domain and renamed the nodes
GRISU-SPACI-NAPOLI 72/74 scheduled m/w update plus network problem
INFN-CATANIA 73/73 unscheduled and difficult upgrade of DPM database schema
INFN-NAPOLI 69/69 Network problem ( switch serving WNs broken) + queues misconfiguration (no slots reserved to certification jobs)
INFN-TORINO 57/86 scheduled downtime for site upgrade
SISSA-Trieste 72/72 lcg-cr random errors. Investigations in progress
NorthernEurope
BEgrid-KULeuven 64/64 Bitten by BDII upgrade problem (6 days) -JRS
ITPA-LCG2 70/70 expiration of the host certificates
NDGF-T1 72/72 Upgrade of dCache version & subsequent SRM and SAM test problems, TSM/dCache configuration problems -JRS
NO-NORGRID-T2 67/67 SAM tests fail because of SRM problems after dCache upgrade -JRS
T2_Estonia 65/65 Overloaded CE causing unstable cluster, SAM tests failing due to missing SE name while checking free space -JRS
UNIGE-DPNC 58/58 6d unscheduled Security maintenance. "Don't know yet how long they will be out of action" -JRS
Russia
BY-NCPHEP 34/34 Site appeared during the month, plus unscheduled SL5 + gLite 3.2 installation - JRS
RU-Novosibirsk-BINP 0/0 Site appeared on 15 Oct, but hasn't passed any tests since -JRS
RU-Phys-SPbSU 40/40 DomainName registration(delegation) problem -JRS
Ru-Troitsk-INR-LCG2 62/98 OK, downtime declared correctly for SL5 migration - JRS
SouthEasternEurope
BG02-IM 42/42 "Hardware tests and repair" -JRS
IL-TAU-HEP 74/74 The site is new in the production level and they had to deal with a few small problems plus an electric major problem. Because of this, this week they have moved the site to a more reliable/safer place
RO-08-UVT 69/69 The site suffered from problem of the BDII LDAP service during October (from 20.10 to 31.10)
RO-09-UTCN 43/43 Unscheduled DNS failure, root exploit fix, CRL update failures, yaim configuration stalled while configuring the CE
RO-14-ITIM 37/37 Storage element down (19d, unscheduled) -JRS - The site suffered from a numerous of problems during the reporting period: 1. kernel update problem, 2. hardware problem in the Storage element (the system had to be reinstalled) 3. network problems (hardware problem to the main router of site’s ISP) – the network problem site persists – they are working on it
RO-15-NIPNE 62/62
TR-01-ULAKBIM 22/22 Unscheduled CE, SE and WNs upgrade -JRS. After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update. The top-level BDII is stable after a fresh installation of bdii packages and configuration. In addition to that TR-01-ULAKBIM site has been in maintanence for nearly 2 weeks. It is upgraded with new hardwares for CE and WNs, also it has been migrated to glite3.2 with SL5.3. Since 28/10/2009, the TR-01-ULAKBIM site is now working with 1200 worker nodes with IB interconnect.
TR-03-METU 62/62 After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update.
TR-04-ERCIYES 61/61 Unscheduled upgrade of CE and SE -JRS After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update.
TR-07-PAMUKKALE 68/68 After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update.
TR-09-ITU 71/71 Unscheduled downtime in CE, possible hardware problem in the server -JRS After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update.
TR-10-ULAKBIM 67/67 After the Update 57 for gLite 3.1, the top-level BDII and sites sBDII were not stable but only after the rollback to the Update 56. All TR-* sites are effected from the BDII update.
SouthWestEurope
MA-01-CNRST 53/53 Unscheduled installation of new SE, plus unscheduled site update to solve local root vulnerability MA-01-CNRST -JRS
UB-LCG2 60/60 Seriuos but still unspecified network issues at the university hosting the Tier-2
UMinho-CP 61/61
UNICAN 63/63
UK/I
UKI-LT2-QMUL 66/66 Various hardware and software issues; NFS server down; unscheduled Lustre Filesystem upgrade -JRS
UKI-LT2-UCL-CENTRAL 69/69

September 2009

The September League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific
CN-BEIJING-PKU 0/0 Site has been suspended for consistent poor availability
HK-HKU-CC-01 51/51
JP-KEK-CRC-01 14/14
MY-UM-CRYSTAL 0/0
PAKGRID-LCG2 73/73 Hardware problem with PC running siteBDII
TH-HAII 74/74
TW-NCUHEP 3/3 Infrastructure upgrade, but with no scheduled downtime
TW-NTCU-HPC-01 73/73 System upgrade, downtime not scheduled
Taiwan-IPAS-LCG2 19/19 Power maintenance and site upgrade, but no scheduled downtime. Site has been suspended for consistent poor availability
VN-IFI-PPS 52/52
VN-IOIT-HN 43/43
CERN
Umontreal-LCG2 73/73 We had two problems this month. For the first problem, ticket GGUS-51387, we had no explaination for it and the second problem was due to srmv2 in our SE.
CentralEurope
ELTE 49/49 Host certificates expired & problems getting new ones (GGUS-51953)
FMPhI-UNIBA 69/69 Security vulnerability; site closed during fix (GGUS-51952)
PEARL-AMU 57/57 Maintenance works on SE, SQLite DB problems and problems with two WNs (GGUS-51951)
egee.fesb.hr 55/55 Problems with SE & installation of new WNs (GGUS-51950)
GermanySwitz.
SWITCH 5/5 Nodes reinstallation due to "unexpected behaviour"
UNI-SIEGEN-HEP 57/57 We had only 57% availability due to two downtimes. The first one was a power cut at our university in Siegen and the second one was a software upate on our whole computer cluster this month. Some additional problems with Kernel patches causes then the availability you have seen.
wuppertalprod 70/70 As we had one "black hole" WNs which had a wrong time. ATLAS jobs were not matched to that node (so it basically on ran ops jobs), so from ATLAS we were efficient and we have realized the problem too late. After fixing this, the problem went away.
Italy
CNR-ILC-PISA 55/55 Problem with fetch-crl file, related command wasn't referenced into the cron itself (GGUS-51355).
CNR-PROD-PISA 51/51 several test jobs blocked the CE, causing failures with "proxy expired" (GGUS-51539)
ESA-ESRIN 54/99 scheduled downtime from 2009-08-10 to 2009-09-14 for: gLite update on CE, Storm, LFC + migration of some gLite services to an other server (site remained in production because the first duration of the downtime was less tyhan a month)
INFN-FERRARA 44/44 problem related to the SE: it is on a virtual machine, and some settings have been changed (in particular a network port dedicated), hoping it will be more stable
INFN-NAPOLI 57/57 unexpected blackout problems, involving also the UPS
INFN-NAPOLI-PAMELA 68/69 unexpected blackout problems, involving also the UPS
INFN-PERUGIA 45/92 scheduled downtime from 2009-07-02 to 2009-09-15 for datacenter changing room. Site status "uncertified" during that period
NorthernEurope
BEgrid-KULeuven 51/51
BEgrid-ULB-VUB 74/74
CSC 57/57 disk system failure
EENet 41/41 CE in testing, unscheduled
HPC2N 64/64 New lcg-CE was not in-place until the 11th of September
HTC-BIGGRID 63/63
ITPA-LCG2 66/66 We were installing/configuring spektras.itpa.lt CE
KTU-BG-GLITE 71/71
Russia
RRC-KI 23/23 Upgrade to RHEL 5 and gLite 3.2, scheduled downtime declared
RU-SPbSU 61/61
ru-Chernogolovka-IPCP-LCG2 65/65
ru-PNPI 20/20
SouthEasternEurope
BG02-IM 73/73
GR-04-FORTH-ICS 1/1 StoRM configuration error, unscheduled downtime extended several times
GR-09-UoA 74/74 computer centre move
IL-BGU 21/21
TECHNION-HEP 64/64 Problem with the ntp deamon running at the site. It is fixed.
TR-04-ERCIYES 20/20 TR-04-ERCIYES had a Maradona error on job submission. The admins reconfigured the whole site, checked logs and followed up the network traffic before but SAM tests were not stable. The CE and SE upgraded with new hardware and the site were reconfigured once more. Now SAM tests are OK. We are waiting to be stable after the new upgrades.
SouthWestEurope
CFP-IST 58/58 Harware failure, system admin is away
DI-UMinho 70/70
LIP-Coimbra 31/98 Scheduled cluster reconfiguration and upgrade
UB-LCG2 54/54 Unspecified network issues
UK/I
UKI-LT2-RHUL 58/58
UKI-LT2-UCL-CENTRAL 45/60 Scheduled upgrade of cluster filesystem

August 2009

The August League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific
CN-BEIJING-PKU 14/36
IN-DAE-VECC-EUINDIAGRID 11/11
JP-HIROSHIMA-WLCG 68/84
MY-MIMOS-GC-01 51/51
TH-HAII 58/58
TW-NCUHEP 54/54
Taiwan-IPAS-LCG2 26/26
VN-IOIT-KEYLAB 47/48
CentralEurope
BY-UIIP 69/69 Site administrator was blocking ports >55000 for incoming connection and it caused failures for CE-host-cert-valid test.
ELTE 47/82 Problem with obtaining new host certificates for site caused by holiday period.
HEPHY-UIBK 68/95 Site was forced by the university to go offline because of some upgrades. Additionally site had also serious problems with storage system.
PEARL-AMU 36/55 Low availability and reliability were caused by problem with disk space.
France
ESRF 57/73
IN2P3-LPSC 54/85 Two weeks of SD was planned with the approval of the French ROC due to foreseen electrical work in the machine room. That then explains the poor monthly availability.
IPSL-IPGP-LCG2 51/77 One week of SD was planned with the approval of the French ROC due to the lack of staff during the vacations. The SD had to be extended to find and to apply the fixes linked with the security alerts.
GermanySwitz.
GoeGrid 68/70 Unscheduled downtime due to the emergency power outage of a burning transformer.In addition we were facing problems with our site-BDII. Those problems only occured from time to time, but they caused the service unstable and lower our overall availability.
SWITCH 40/40 As for the performance of the site in July and August this was mostly due to the fact that the installation of the June EUGridPMA bundle (and later versions) was delayed consistently at the partner sites: our CE certificate is in fact a new SWITCHpki grid server certificate (issued by QuoVadis) which was installed last June and for quite some time the SAM tests were failing because the QuoVadis CA was not recognized by the various central grid nodes such as the WMS etc.
UNI-SIEGEN-HEP 52/81 we had only 52% availability due to unscheduled downtime. Our main adminitrator has left and so we need some time for solving problems at the moment. In august we had the downtime due to the security incident related to some Kernel versions.
Italy
ESA-ESRIN 29/93 scheduled downtime for gLite update on CE, Storm, LFC + migration of some gLite services vs an other server.
INFN-BARI 74/74 In August, while the Availability of the INFN-BARI site was slightly above the threshold, the reliability has been 74%, which is in fact below the threshold, but only slightly. This low reliability was due to two reasons:1) In the first days of august we were still recovering from the chiller fault occurred on the 28th of july. 2) During the two central weeks of august, the Physics Department in Bari was closed and the personnel in vacation. In such conditions the farm monitoring and management could only be done remotely and with a reduced man-power. This caused some delays in the discovery of faulty components and in their restarting/replacement.
INFN-CAGLIARI 67/67
INFN-FERRARA 51/51 the site is experiencing intermittent problems with the SAM tests since quite some time. We run extensive tests on the RAM (even if the machine uses ECC ram) but we ruled it out. What puzzles us is that sometimes the test flips from ok to not ok (or viceversa) even if we don't touch the system. We are checking the different logs and already tried a few things but not found the solution yet. By the way I checked with a colleague working for LHCb experiment, and their tests have not spotted a single glitch in the last month (i.e. we are fully certified to run their jobs). I did this crosscheck just because I know one of the people there: I know is not an extensive test but still the result puzzles me.
INFN-LNS 21/46 From 8 till 23 august scheduled downtime for summer holidays. From 24, General malfunction on site for UPS problems. Only 1 people available, this means a lot of time for fix the open tikets.
INFN-NAPOLI 13/99 hardware problems and some updates missing caused a lot of SAM failures
INFN-NAPOLI-ARGO 51/53 some missing updates (CAs) due to the holiday period
INFN-NAPOLI-PAMELA 34/95 farm unattended and in downtime for almost all the month
NorthernEurope
BEgrid-UGent 65/65 No reply from site despite several mails and a GGUS ticket. Site suspended 28/9/09 by NE ROC
CSC 67/83 The egee-ce.csc.fi front end was down mainly due to the unforeseen disk failures on the Murska cluster. Actually the front end itself was ok but only the WNs and shared filesystem
HPC2N 1/1 We are still trying to set up a new configuration for a new cluster. We got an SE (DPM) up the 25th of August but we are still having problems with the CE. It is our hope that we can get it up before the 14th of September.
ITPA-LCG2 53/91 We've been using for a long time a 'sdj' queue for the sam tests, but for some reason a WMS, routing the sam jobs from the CERN, stopped sending them to the queues named 'sdj'. I have indicated this in the corresponding GGUS ticket assigned for me. Although the CERN sam tests were failing, our site was fully functioning as you could see at balticgrid's local sam tests website, sam.mif.vu.lt
PDC 63/69 The top BDII failed a couple of times, unfortunatly over week-ends
Russia
BY-NCPHEP 0/ n/a
Ru-Troitsk-INR-LCG2 71/72
ru-PNPI 61/61
SouthEasternEurope
HG-05-FORTH 71/73 The top-bdii used by the site is the mon01.ariagni.hellasgrid.gr, the site failures caused by problems occurred to this top-BDII. The site admins informed us that they changed the top-BDII used by the site, to th bdii.core.hellasgrid.gr round robin mechanism we maintain in the HellasGrid infra.
RO-08-UVT 47/93
RO-13-ISS 53/74 1.From 03.08.2009 to 13.08.2009 the main power cable pierced - no electricity. The electrical network and power capacity has been upgraded (see link). 2. From 15.08.2009 to 20.08.2009 there was a very bad network connection from the ISP ( principal optical cable was damaged).The sie had to run with a backup connection and the speed was very low - SAM tests failed.
TECHNION-HEP 73/73
!SouthWestEurope
ESA-ESAC 58/58
LIP-Coimbra 0/ n/a
UAM-LCG2 67/94
UPV-GRyCAP 17/19
e-ca-iaa 54/54
UK/I
UKI-LT2-RHUL 0/0 The site bdii was marked as being in downtime for the whole of this month, the site was otherwise operational.

July 2009

The July League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific
Australia-ATLAS 37/37 Site was testing a PPS version of the BDII which was incompatible with the GStat tests. This resulted in the low site availability for July
CN-BEIJING-PKU 2/2
IN-DAE-VECC-EUINDIAGRID 1/1
JP-HIROSHIMA-WLCG 52/77
NCP-LCG2 61/97
TH-HAII 0/0
TW-NTCU-HPC-01 59/59
Taiwan-IPAS-LCG2 10/10
CERN
UFRJ-IF 57/59 The problem is linked to the network issue the site experienced in June since the site internal monitoring system gives 99.595% of uptime and the site non-EGEE regional nagios says 37.55%. In July, the site also discovered a glitch on the monitoring system related to the fact that the site non-EGEE regional nagios uses hard coded IP addresses and the site changed from one network to another. So, the site would expect an availability of 68% and the EGEE monitoring system measured 57%. The site believes the difference is due to a different methodology.
France
IPSL-IPGP-LCG2 52/96 This is a small site mainly supporting ESR with only one site administrator. In accordance with the main VO and the ROC, the site was put into downtime during the site administrator's vacations.
GermanySwitz.
LRZ-LMU 67/67 LMU-LRZ had a bad SRM failure, corrupt system disk and pnfs Db. This took up to 10 days to repair(from 5th July), hence the 67% availability in July.
SWITCH 68/68 As for the performance of the site in July and August this was mostly due to the fact that the installation of the June EUGridPMA bundle (and later versions) was delayed consistently at the partner sites: our CE certificate is in fact a new SWITCHpki grid server certificate (issued by QuoVadis) which was installed last June and for quite some time the SAM tests were failing because the QuoVadis CA was not recognized by the various central grid nodes such as the WMS etc.
Italy
CNR-ILC-PISA 66/92 two unscheduled downtime due to electrical problems, quickly announced on GOC-DB (as attested by reliability value). After the second electrical problem, the ntp daemon didn't start, and we found out that an update performed in the previous days changed the SELinux policies, and by default ntpd was blocked
CYBERSAR-CAGLIARI 16/37
CYBERSAR-PORTOCONTE 55/69 site enterd in production on Jul 27th, the replica test on CE failed due to powercut of the BDII (published by CYBERSAR-CAGLIARI site) set on the WNs
INFN-NAPOLI 46/51
INFN-NAPOLI-PAMELA 57/63 several power supply problems caused unexpected power cut
SPACI-LECCE 39/50 hardware problem on SE
NorthernEurope
ITPA-LCG2 5/28 First we had a suspicious activity which might have been a security incident, but we had no further prove of it. To be on the save side we have decided to reinstall our stack from scratch with the latest software. Later we had a hardware failure on the server hosting virtual machines.
KTU-BG-GLITE 62/62 GocDB was unusable at times thus I was unable to properly register downtimes. Our site was functional, except for the accounting data not being published. In terms of availability it was providing services to users most of the time.The system administrator was also on vacation during this period.
KTU-ELEN-LCG2 73/73 Thats because of vacation periods. Also increased network load (new user jobs, our SE used to keep job results from different clusters, Apel instablity, BDII timeouts) maybe have contributed to negative metrics.
Russia
RRC-KI 67/80
RU-SPbSU 57/64
ru-PNPI 68/69
SouthEasternEurope
GR-07-UOI-HEPLAB 51/92 CE node was down due to power supply failure, from 2009-07-17, 12:00:00 [UTC] to 2009-08-02, 14:07:00 [UTC] (see relative link)
RO-15-NIPNE 71/74 Cooling system failures (see link and link) and also an unexpected problem to the electrical power from 25/7/09 - 28/7/09.
TR-05-BOUN 59/75 The TR-05-BOUN site had been moved from the South Campus to Kandilli Campus of Bagazici University as a results it was in maintenance at beginning of July (see relative link).
WEIZMANN-LCG2 69/69
!SouthWestEurope
ESA-ESAC 46/46
IEETA 67/77
LIP-Coimbra 53/97
MA-01-CNRST 43/43
NCG-INGRID-PT 26/92
e-ca-iaa 53/69
UK/I
UKI-LT2-IC-HEP 69/81
UKI-LT2-QMUL 71/71
UKI-LT2-UCL-CENTRAL 5/33

June 2009

The June League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific  
Australia-ATLAS 0/0 Because of network latency, site was testing a version of the BDII which was incompatible with the GStat tests. This resulted in zero availability for June, even though the site was OK
CN-BEIJING 1/1
HK-HKU-CC01 55/55
IN-DAE-VECC-EUINDIAGRID 36/36
JP-KEK-CRC-01 0/0
KR-KISTI-HEP 23/27
MY-UPM-BIRUNI-01 4/4
PAKGRID-LCG2 49/78
TH-HAII 37/37
Taiwan-IPAS-LCG2 0/ n/a
CERN  
ALBERTA-LCG2 55/55
UFRJ-IF 69/76 The site is having network problems. The site Nagios shows good results. They are negotiating a better connection
Uniandes 52/59
CentralEurope  
BY-UIIP 34/52 Main site admin was on holidays and the backup admin didn't answer. During that time the problem with missing libraries on WNs appeared and also SE misconfiguration. Resume: procedure for delegating administrator rights of the site admin hasn't been developed and practiced at UIIP. The fault is on the main site administrator.
egee.irb.hr 44/81 Site has repeated problems with cooling system
France  
GermanySwitz.  
UNI-FREIBURG 66/66 In the first two week of June we had sever cooling problems resulting from a malfunctioning climate control in our computing center. In addition we suffered power cuts that resulted in total failures of our hardware. In the meantime we have overcome these technical problems and are running stable.
Italy  
CNR-ILC-PISA 46/82 After the application of Update 45 and 46, in our SE (dpm) was present a problem of dependency with library glite-info-dynamic-dpm. For some days the SE don't published the GlueSAPath variable and the tree GlueSEUniqueID. After ten days of attemps with the help of IT CMT, we have reinstalled the SE as new and only when solved other problems on some x86_64 packages the SE has reply well to SAM tests
CYBERSAR-CAGLIARI 69/83
INFN-CATANIA 69/69 globus-job-manager-marshal service doesn't delete unused files under /opt/globus/tmp/gram_job_state/ Up to 20000 files, job submission is ok, over that number it starts to have problems. I delete by hand every 2 days a lot files, but it's not a good solution for a site manager. Development team released 3-4 versions of globus-gma rpm but the problem is the same. Our availability e reliability was under 70% only during the first week of June because starting from second one, I fixed by hand.
INFN-GENOVA 65/65
INFN-MILANO 71/71 The main source of errors in the SAM tests in the month of June is the actual Storage Element, which it is a dpm srm storage element. As short term solution we are going to upgrade the dpm server hardware in the next days, as mid term solution we are going to replace DPM based SE with STORM based SE
INFN-NAPOLI 62/64 problems with CE file system due to /opt/exp_soft directory, finally moved to a new disk. the solution wasn't rapid due to poor manpower
INFN-PARMA 47/72 On June 17 we got an authentication problem on our Storage Element Storm v 1.3. The problem arose only for users belonging to sgmops group. The effort to solve the problem was unsuccessful so we had to install the server from scratch with the newer version of Storm 1.4.
INFN-ROMA1-CMS 68/72 In June 2009 INFN-ROMA1-CMS hoped to have reached a good level of stability but this proved not true due to continuous flips of the CE. So we took the decision of reinstalling the middleware on the CE This process was long and painful: lack of complete, up-to-date and understandable documentation, specifically on the variables in siteinfo.def, let us loose several days of availability, despite help from our ROC. So, between this process and previous CE glitches, we lost almost 10 days in June. Now the CE is fully reinstalled and we hope to have all components under control.
INFN-ROMA1-VIRGO 53/70 failure happened on the CE services and the site manager could not intervene at once. Hence the site remained ~ 10 days in unscheduled downtime.
SPACI-LECCE 35/56 Hardware problems on the Storage Element,changed hardware, and also IP and hostname. Some network problems that need further investigation.
UNI-PERUGIA 39/55 Configuration problems related to the SE machine (se.grid.unipg.it) corrupted the whole UNIPG-SITE (tkt starting date 2009/6/16 10:58:32). The sgmops user Judit Novak was not able to authenticate itself in the machines belonging to UNIPG-SITE. The problem has been solved (tkt closing date 2009/7/1 9:17:03) by adding more sgmops pool accounts. Due to the unpredictable nature of the problem we was not able to set a scheduled downtime.
NorthernEurope  
ITPA-LCG2 54/54 We had a security incident, and now are upgrading/reinstalling the servers. The site is in unscheduled downtime now.
PHILIPS-TGRID 68/68 We used to add all the VOs we support (on HTC-BIGGRID) also to PHILIPS-TGRID. This meant that jobs (that did not have much requirements) could also be queued in PHILIP-TGRID. Some VO's solve this by adding their own software tags to classify a site as 'OK' to run jobs, but not all VO's do this. In the end we removed all non-infra VOs from PHILIPS-TGRID. - pbs/tmpdir was not ideally configured on our WN's, so it filled up very fast - we reconfigured the WN's. - We did not give priority to solve problems on PHILIPS-TGRID (because it is a testsite).
VGTU-gLite 59/80 VGTU-gLite was in maintenance status for long time in June and then after it was reinstalled with new OS and started to receive the SAM tests one of the WN's was randomly failling because of the I/O errors on HDD and it was hard to detect such problem as we dont have hard disk monitoring tool in place yet. That WN was removed from the site until the HDD will be replaced with the new one. Now we have monitoring for all hard disks of WNs using 'smartmontools' tool and we run various tests once per week to prevent such problems appear in the future.
Russia  
RRC-KI 69/72 1. There was some problem with air-conditioners. As a result the unscheduled downtime deal with SE was. 2. Because STEP'09 stress test of Atlas some tuning of CE and SE have been done. 3. The network connection of site with Europe was down during a few days.
IPCP-LCG2 57/57
ru-PNPI 65/66 This site actively participate in STEP'09 stress test. In particular ATLAS part. Unfortunately, some bottleneck in the internal network lead to the site crash during the test. The problem was investigated and the local network is under modification.
SouthEasternEurope  
GR-05-DEMOKRITOS 35/84 Several problems with power supplies from previous months. It seams that the power supply problems doesn't exist any more. We bought a new server to act as a cluster controller and we are at the face of reconfiguring the site.
RO-07-NIPNE 60/62 We have a problem with our Cooling system during this month so the cluster had stop it very often
TR-05-BOUN 63/68 The DNS server of Bogazici University was changed and had problems with reverse DNS records of TR-05-BOUN. After that problem, TR-05-BOUN was in maintenance. The site had been moved from the South Campus to Kandilli Campus of Bogazici University. At the same time, its CE and SE had been upgraded
SouthWestEurope  
UNICAN 65/65
UK/I  
UKI-LT2-UCL-CENTRAL 13/18

May 2009

The May League Table is here.

show site feedback hide

Region A%/R% Reason
AsiaPacific  
Australia-ATLAS 0/0 Because of network latency, site was testing a version of the BDII which was incompatible with the GStat tests. This resulted in zero availability for May, even though the site was OK
CN-BEIJING-PKU 7/9
HK-HKU-CC-01 54/54
IN-DAE-VECC-01 20/21
JP-KEK-CRC-01 43/43
KR-KISTI-HEP 61/69
MY-UPM-BIRUNI-01 0/0
NCP-LCG2 35/59
TW-NCUHEP 66/66
Taiwan-IPAS-LCG2 0/N/A
CERN  
TORONTO-LCG2 55/55 seems to be getting back into shape and I have checked they are fine for June, the May figure was in any case on an upward trend and I know they had dCache issues.
UFRJ-IF 53/61 seems to have had an issue with the site BDII. I have sent an e-mail as you know but they look much better in June
CentralEurope  
PEARL-AMU 68/68
egee.irb.hr 18/38
France  
GermanySwitz.
GSI-LCG2 68/68 system is Debian linux which was not completely supported by the gLite middleware. No detailed news about improvement or comment for the month related to A/R monitoring
MPI-K 62/78 Due to the problematic tickets 47872, 47920, 47952 opened last month which have been closed with SOLVED in May. There has been no problems since last 30 days with the site according to SAM.
TUDresden-ZIH 17/32 May was the month we set up our site. Our site was set to production state though we still had problems in the initial setup. The problems took us some time to fix.
UNI-BONN 68/88 We were on a unscheduked downtime from 2009-04-20 at 8:00 to 2009-05-17 at 8:16:00 to increase the storage capacity and install new dCache. There were two extensions after the initially planned 4 days intervention since we had some problems with the new dCache installation. This was finally solved. We also had sam tests failing on a weekend later in the month due do some problems with spacetoken configurations needed for ATLAS VO but affected the storage globally.
Wuppertalprod 16/17 We upgraded our dCache instance and lost the complete pnfs database doing so. Support was very limited during that time. dCache.org people tried several time to save to data base, at the end we had to give up and install from scratch with all data lost
Italy  
CNR-ILC-PISA 64/89
CYBERSAR-CAGLIARI 42/60 problem with globus-gma processes
INFN-BARI 58/75
INFN-FERRARA 62/68 it is back in production May 8th, after a reinstallation of the farm and a supporters change; tests failed for configuration problems
INFN-MILANO-ATLASC 69/70 new site in production from May 18th: there were authentication failures relating ops users, fixed after several days of hard debugging
INFN-NAPOLI-PAMELA 66/80
INFN-ROMA1-CMS 65/65 The site is now fully operational. However in May we had two problems One was related to trying to disable an unsupported VO which resulted in a major misconfiguration on the CE. This took away 4 days, happening before a long weekend. The other was related to a misconfiguration on the SE which again took away about 4 days to resolve
INFN-ROMA3 41/46 The site had some problems with storage during the month of May. At the beginning of the month we added some disk to a couple of volumes and we had to restripe them for proper balancing (we are using GPFS): this had an impact on our Storm SE, which started timing-out, so we decided to declare an unscheduled downtime while the operation was ongoing. Around the 20th of the month, the addition of a new disk-server broke GPFS for a few hours. Throughout the month, there have been intermittent problems, mainly due to the SE. For the above reasons, we decided to reinstall our Storm SE, and this happened during the scheduled downtime around 8th June. Since then, the site is performing well.
INFN-TRIESTE 25/45 problems with globus-gma and on the storage element (STORM)
SISSA-TRIESTE 64/64 Problem with globus-gma and 2 unscheduled energy blackout.
SNS-PISA 60/64 The availability problems of SNS-PISA Grid node have been related to a large job submission from some VO in the past months, with an overload of the CE and BDII (running on the same hardware). The problem seems to be solved introducing a queuable maximum job limit via PBS three weeks ago.
SPACI-LECCE 57/74 unexpected hardware failure on the storage element, it has been moved on another machine
NorthernEurope  
VGTU-gLite 44/44 At this moment VGTU-gLite is in downtime, I think we always had some problems in the last half of a year because of SLC3 OS (which is long time not maintained anymore + middleware) which we run for CE/SE - now the system is in reinstall status and we'll bring it up as soon as possible with new fresh install of gLite middleware and new version of SLC4.
Russia  
JINR-LCG2 39/94 JINR is one of the main site in Russia. It is doing some reconstruction of networks facilities without downtime. Unfortunately the reconstruction required more time then expected. I hope that JINR will stabilize in June.
Kharkov-kipt-lcg2 53/83 Kharkov site has some troubles with network connectivity with Europe. Because this site is in Ukraine we have not possible to help them at this point
SouthEasternEurope  
CY-01-KIMON 63/63 the reason that our site had a low availability was the same with the previews month, the site-bdii. As I wrote on my previews report we had lcg-CE TORQUE-server and TORQUE-utils and site-bdii installed on the same machine. The problem appeared by the end of April, when we supported new vo (lhcb) and we had it untill the first week of May when we removed the site bdii in a new machine. Since there this problem has deseapeard and site-bdii is running sucessfully. We will do our best for our site to meet the specified criteria.
GR-05-DEMOKRITOS 7/99 Hardware problems mainly with power supplies
MK-01-UKIM_II 63/65
RO-13-ISS 71/74 RO-13-ISS had connectivity problems and UPS failures, which affected the site, as the power offs were often. It still has to replace the UPS batteries from the vendor with new ones, and there could be some problems in the following month too. But the site is registering this downtime in GOCDB and reliability should be higher.
RO-15-NIPNE 67/69 RO-15-NIPNE had a problem with specific LHCb software installation using SLC5 and gcc4.3 as it is presented here. It seems that what was presented there was not functional for them, and so they were failing more tests, but they have now reverted to slc4.7 gcc3.4, and the site is functional.
SouthWestEurope  
BIFI 67/85
ESA-ESAC 68/68
UB-LCG2 32/35
UPV-GRyCAP 48/49
e-ca-iaa 62/62
UK/I  
UKI-LT2-UCL-CENTRAL 23/25 1.Lustre file system slowdown - should now be fixed 2.CE "Funnies" lead to CRL's not downloading reliably. Also peculiar behavior at shell prompt. Reboot fixes for a while. 3.Proxy timeouts caused by very full cluster. Have restricted ops jobs to 15 mins so they backfill and boosted priority. 4.CE appear to be overstretched this causes 2 problesm. a)OOM killer kicking in and killing stuff b)Downloading of payload from CE is very slow causing SAM tests to timeout (usually during CAVER test). The main underlying problem is an overloaded CE which is about to be upgraded (fixes 2 and 4).

Region A%/R% Reason
AsiaPacific
CERN
CentralEurope
France
GermanySwitz.
Italy
NorthernEurope
LatinAmerica
Russia
SouthEasternEurope
SouthWestEurope
UK/I
-->

-- JohnShade - 09 Jul 2009

Edit | Attach | Watch | Print version | History: r112 < r111 < r110 < r109 < r108 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r112 - 2010-05-19 - AlessandroPaolini
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback