WLCG-OSG-EGEE Ops' Minutes Mon 26 Oct 2009

Summary

  • gLite release 3.1 UPDATE 57 was rolled back last Friday. UPDTE 58 will be available during the week to replace it.
  • A reminder for sites to move to WMS 3.2 (available in gLite repository). This must be done by the end of October! The list of WMS to upgrade is attached to the agenda and available here.
  • The RB service was made obsolete quite some time ago and it should no longer be run in the production infrastructure. Please, decommission the RB and install an up-to-date WMS as soon as possible.

Attendance

EGEE

  • Asia Pacific ROC: Jason Shih
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Maite Barroso, Diana Bosio
  • French ROC: Rolf Rumler
  • German/Swiss ROC: Angela Poschlad, Wen Mei
  • Italian ROC: Alessandro Paolini, Paolo Veronesi
  • Latin American ROC: Luciano Diaz, Andres Holguin Coral
  • Northern Europe ROC: Michaela Lechner, Gert Svensson
  • Russian ROC: Lev Shamardin
  • South East Europe ROC: Marios Chatzangelou
  • South West Europe ROC: Christian Neissner
  • UK/Ireland ROC: Jeremy Coles
  • GGUS: Helmut Dres
  • GOCDB: Absent

WLCG management

Herry Ranshall, Markus Schulz

Feedback on Last Week's Minutes

None was given.

EGEE Items

Grid Operator Hand Over on Duty

  c-COD Team
From ROC NE
To ROC IT

  • Report from cCOD:
There was a problem with the report as the handover log was filled in, but it was not sent. It was resent 20 minutes before the meeting. Basically there are no important points tp raise, only 3 expired tickets for ROC CERN. ROC CERN had already seen them earlier in the day and acted on them.

Sites Considered For Suspension

None

PPS Reports and Issues

  • Nothing to report

gLite Release News

  • gLite 3.1 UPDATE 57 released last Thursday. SAM failures immediately reported 30 CEs not published
correctly in the BDII and also there were top level BDII problem as the BDII failed to publish sites that were publishing correctly. A roll back was announced on Friday.

  • Problem 1 was in the conditional restart of the service, so it affected only sites that were doing auto-update
  • Problem 2 was caused by a missing rpms in SL4, only available for SL5, not detected in certification nor PPS, the scenario was not considered in Certification and the problem was masked in pre-deployment tests as the rpm was on the PPS machine.

Markus: the first problem could not addressed by roll-back as mistake is ithe logic to restart the bdii service ,but auto-update is not a scenario that should be considered

John Shade: if the problem affected 30 sites, and possibly more as we had a peak of 64 CE-sft-job test failures, maybe we should make sure that auto-update does not break things if sites decide to use it.

Markus: the problems are essentially due to the fact that we are maintaining 2 code branches by hand.

Antonio: this once again highlight the importance of the roll out testing.

Maite: can we do anything in the development stage to avoid these problems ?

Markus: the transition from one version to the next takes a lot of time, and this is the cause of the problem, the fact that we have to back port some of the features introduced in the new versions.

Antonio: problem 2 was known as a patch had been rejected previously. So it would be good if any know issue were automatically taken into consideration by the certification team.

Jeremy: why was that the roll back request sent on Friday?

Maite: roll back was decided in 30 mins. The release came out on Thursday, but we did not spot the issue before Friday lunch and acted swiftly after that.

Antonio: UPDATE 58 will go out soon containing everything that was in UPDATE 57 but the BDII update, which is undergoing further testing.

EGEE Items From ROC Reports

  • 4 reports not validated: AP, France, Russia, GermanySwitzerland.
  • Points from ROC-IT:
    • Which is the version for each Storage Element implementation to be compliant with the "Usage of Glue Schema v1.3 for WLCG Installed Capacity information"? As ROC, we could push and follow the upgrade of the old version and validate the published data.
    • The Baseline versions of services and client tools for WLCG operations (https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions) seems to be update last 02-Jun-2009. This useful page should be update more frequently (at every gLite update?) just to be sure that the recommendations are not out of date.

Maite will check the SE version and put it in the minutes.

Grid Service Interventions

  • Consult links on the agenda page.

Misc

  • Reminder for sites to move to WMS 3.2 (available in gLite repository). This must be done by the end of October! The list of WMS to upgrade is attached to the agenda and available here.
  • The RB service was made obsolete quite some time ago and it should no longer be run in the production infrastructure. Please, decommission the RB and install an up-to-date WMS as soon as possible. Still 7 around: GILDA-INFN-CATANIA, HG-01-GRNET, IFCA-LCG2, IFIC-LCG2, JP-KEK-CRC-01, RRC-KI, TR-01-ULAKBIM.

Mario(SEE): 2 Rbs in SEE. The Turkish one will announce its end of life soon. the HG-01-GRNET will remain operational for the next month for a specific reason of usage. I have sent an e-mail to Nick explaining the reasons.

  • One question on down times longer than 1 month.
Maite: no automatic procedure, but a warning e-mail to the ROC to discuss the closure of the site.

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.OCC 2007-03-05 Example Action Item 2007-03-06 SteveTraylen   edit

Review of Open Action Items

One action on SAM MPI tests: they are in validation, new MPI working group (John Walsh and Isabel Campos) will open tickets for the sites. the action can be closed, SAM is working with John Walsh.

Maite: please announce it to the OPS meeting before the tests are put in production

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

AOB

  • On Friday the first lead ions will pass through the ALICE detector and on Sunday the first protons will pass through the LHCb detector.

Next Meeting

The next meeting will be Monday, 02 Nov 2009 14:00 UTC (16:00 Swiss local time).

  • Attendees can join from 13:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 14:00 UTC (16:00 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0148141


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2009-10-28 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback