gLite release 3.1 UPDATE 57 was rolled back last Friday. UPDTE 58 will be available during the week to replace it.
A reminder for sites to move to WMS 3.2 (available in gLite repository). This must be done by the end of October! The list of WMS to upgrade is attached to the agenda and available here.
The RB service was made obsolete quite some time ago and it should no longer be run in the production infrastructure. Please, decommission the RB and install an up-to-date WMS as soon as possible.
Attendance
EGEE
Asia Pacific ROC: Jason Shih
Central Europe ROC: Malgorzata Krakowian
OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Maite Barroso, Diana Bosio
French ROC: Rolf Rumler
German/Swiss ROC: Angela Poschlad, Wen Mei
Italian ROC: Alessandro Paolini, Paolo Veronesi
Latin American ROC: Luciano Diaz, Andres Holguin Coral
Northern Europe ROC: Michaela Lechner, Gert Svensson
Russian ROC: Lev Shamardin
South East Europe ROC: Marios Chatzangelou
South West Europe ROC: Christian Neissner
UK/Ireland ROC: Jeremy Coles
GGUS: Helmut Dres
GOCDB: Absent
WLCG management
Herry Ranshall, Markus Schulz
Feedback on Last Week's Minutes
None was given.
EGEE Items
Grid Operator Hand Over on Duty
c-COD Team
From
ROC NE
To
ROC IT
Report from cCOD:
There was a problem with the report as the handover log was filled in, but it was not sent. It was resent 20 minutes before the meeting. Basically
there are no important points tp raise, only 3 expired tickets for ROC CERN. ROC CERN had already seen them earlier in the day and acted on them.
gLite 3.1 UPDATE 57 released last Thursday. SAM failures immediately reported 30 CEs not published
correctly in the BDII and also there were top level BDII problem as the BDII failed to publish sites that were publishing correctly. A roll back was announced on Friday.
Problem 1 was in the conditional restart of the service, so it affected only sites that were doing auto-update
Problem 2 was caused by a missing rpms in SL4, only available for SL5, not detected in certification nor PPS, the scenario was not considered in Certification and the problem was masked in pre-deployment tests as the rpm was on the PPS machine.
Markus: the first problem could not addressed by roll-back as mistake is ithe logic to restart the bdii service ,but auto-update is not a scenario that should be considered
John Shade: if the problem affected 30 sites, and possibly more as we had a peak of 64 CE-sft-job test failures, maybe we should make sure that auto-update does not break things if sites decide to use it.
Markus: the problems are essentially due to the fact that we are maintaining 2 code branches by hand.
Antonio: this once again highlight the importance of the roll out testing.
Maite: can we do anything in the development stage to avoid these problems ?
Markus: the transition from one version to the next takes a lot of time, and this is the cause of the problem, the fact that
we have to back port some of the features introduced in the new versions.
Antonio: problem 2 was known as a patch had been rejected previously. So it would be good if any know issue were automatically taken into
consideration by the certification team.
Jeremy: why was that the roll back request sent on Friday?
Maite: roll back was decided in 30 mins. The release came out on Thursday, but we did not spot the issue before Friday lunch and acted swiftly after that.
Antonio: UPDATE 58 will go out soon containing everything that was in UPDATE 57 but the BDII update, which is undergoing further testing.
EGEE Items From ROC Reports
4 reports not validated: AP, France, Russia, GermanySwitzerland.
Points from ROC-IT:
Which is the version for each Storage Element implementation to be compliant with the "Usage of Glue Schema v1.3 for WLCG Installed Capacity information"? As ROC, we could push and follow the upgrade of the old version and validate the published data.
The Baseline versions of services and client tools for WLCG operations (https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions) seems to be update last 02-Jun-2009. This useful page should be update more frequently (at every gLite update?) just to be sure that the recommendations are not out of date.
Maite will check the SE version and put it in the minutes.
Grid Service Interventions
Consult links on the agenda page.
Misc
Reminder for sites to move to WMS 3.2 (available in gLite repository). This must be done by the end of October! The list of WMS to upgrade is attached to the agenda and available here.
The RB service was made obsolete quite some time ago and it should no longer be run in the production infrastructure. Please, decommission the RB and install an up-to-date WMS as soon as possible. Still 7 around: GILDA-INFN-CATANIA, HG-01-GRNET, IFCA-LCG2, IFIC-LCG2, JP-KEK-CRC-01, RRC-KI, TR-01-ULAKBIM.
Mario(SEE): 2 Rbs in SEE. The Turkish one will announce its end of life soon. the HG-01-GRNET will remain operational for the next month for a specific reason of usage. I have sent an e-mail to Nick explaining the reasons.
One question on down times longer than 1 month.
Maite: no automatic procedure, but a warning e-mail to the ROC to discuss the closure of the site.
One action on SAM MPI tests: they are in validation, new MPI working group (John Walsh and Isabel Campos) will open
tickets for the sites. the action can be closed, SAM is working with John Walsh.
Maite: please announce it to the OPS meeting before the tests are put in production
Open Action Items
Id
Submitter
Description
Creation
Due
Assigned To
Actions Closed in Last 20 Days
Id
Submitter
Description
Creation
Due
Assigned To
Closed
AOB
On Friday the first lead ions will pass through the ALICE detector and on Sunday the first protons will pass through the LHCb detector.
Next Meeting
The next meeting will be Monday, 02 Nov 2009 14:00 UTC (16:00 Swiss local time).
Attendees can join from 13:45 UTC (15:45 Swiss local time) onwards.
The meeting will start promptly at 14:00 UTC (16:00 Swiss local time).