WLCG-OSG-EGEE Ops' Minutes Wed 10 Sep 2008

Summary

  • The feedback from most EGEE ROCs to SA3's proposal for central software distribution has not been favourable.
  • The last gLite upgrade which introduced versions of GFAL and lcg-utils which were incompatible with BDII V3.0 negatively impacted the EGEE production grid.

Attendance

N.b. Please join the web conference (stating name(s) and affiliation) even if you dial-in by phone, so that the minute taker can see who's there! Sunrise and 0033478930880 are not very helpful indications! Otherwise, we'll have to revert to the time-consuming roll-call at the beginning of each conference!

EGEE

  • Asia Pacific ROC: Absent
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen, Diana Bosio, Maria Dimou
  • French ROC: Helene, Rolf, ?
  • German/Swiss ROC: Angela Poschlad
  • Italian ROC: Absent
  • Northern Europe ROC: Gert Svensson, Ron Trompert
  • Russian ROC: Lev Shamardin
  • South East Europe ROC: Kostas Koumantaros
  • South West Europe ROC: Kai Neuffer
  • UK/Ireland ROC: Derek Ross
  • GGUS: Torsten Antoni
  • GOCDB: Gilles Mathieu

WLCG

  • WLCG Service Coordination: Absent

WLCG Tier 1 Sites

  • ASGC: Absent
  • BNL: Absent
  • CERN site: Ignacio Reguero
  • FNAL: Joe Kaiser
  • FZK: Angela Poschlad
  • IN2P3: ?
  • INFN: Absent
  • NDGF: Leif
  • PIC: Kai Neuffer
  • RAL: Derek Ross
  • SARA/NIKHEF: Ron Trompert
  • TRIUMF: Absent

LHC Experiments

  • ATLAS: Alessandro di Girolamo
  • LHCb: Roberto Santinelli
  • CMS: absent
  • ALICE: absent

Feedback on Last Week's Minutes

None was given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From ROC SEE ROC SWE
To ROC UKI ROC CE

  • The ROCs on duty noticed the effects of the BDII v3.0 incompatibilities with the latest gLite release. The also saw unexpected APEL-pub alarms, after those were put into production by SAM (who hadn't followed the correct procedure).

  • It was deemed that UKI-LT2-UCL-CENTRAL, who are in downtime for three months, should really be suspended.

PPS Reports

Antonio reported a delay in the deployment of the Cream CE due to a proxy renewal mechanism issue discovered in certification (range of ports required is excessive). Pilot Cream checkpoint meeting took place, minutes of which are attached to agenda. Cream CEs will not be seen by default in production.

Many WMS are still running with version 3.0. Steve produced a list of old WMSes (Action item 145) to identify those which should upgrade. Roberto: what is preventing rolling out WMS 3.1? Steve: Nothing. Roberto: Are there any 3.0s around? Steve: Don’t think so, they’re all publishing 3.1...

EGEE Items From ROC Reports

Steve said that there was nothing from the ROC reports that hadn’t been solved.
  • Responses (so far, all fairly negative) to the proposal about central software distribution are being collated (by whom?) and forwarded to SA3. Nick reported that, from the WLCG point of view, they were happy if this was to be WLCG-specific and only used by the experiments.
  • The latest, urgent, gLite release which introduced GFAL and lcg-utils software that was incompatible with BDII v3.0, caused a lot of problems for the production infrastructure. ROC France reported that the lcg-utils incompatibility with gLite3.0 Top BDII showed a weakness within test and certification process. ROC DECH thought that there should be at least 10 days' notice of high-priority updates, but it was pointed out that that was contrary to the notion of "high priority".
  • There were several SAM issues, which John explained as follows:
    • (CE ROC) The 10s timeouts used in some SAM tests is for the network connection only. SAM reckons that if a BDII doesn't respond within 10 seconds, there's little point waiting another 50 seconds (previous timeout was 60s). This was seconded by Steve, but ROCs were asked to open GGUS tickets if they find evidence that the setting is too low.
    • SAM and GridView had not updated to using the new GOCDB connection, so were pointing to an inert version of the database. This explains why newly certified nodes, and nodes coming out of downtime, were not being picked up by SAM.
    • The APEL-pub test was made critical without warning the COD, so both on-duty ROCs flagged this as a problem.
    • In addition, there were (very) intermittent BDII problems at CERN that caused irregular test failures, and an error during the installation of the new CA version tests (new config file not copied) on the SAM UI caused those tests to return no output.
  • SWE ROC reported accounting publishing woes during the weekend, but Derek was not aware of any central APEL problems.

Angela reported that DESY Hamburg were unhappy at having to install ATLAS RPMs. No details were available (to be sent to Steve once they are). Alessandro explained that sites who commit to support ATLAS must support their installation tool. It seems that the objection was more the lack of diplomacy used to request the upgrade, than to the s/w itself. Angela will talk to local ATLAS contacts. Kai had the same problem at PIC (request to install Blah on WNs); he will send the request to Steve.

Kai noted that tickets from ATLAS were going directly to sites as opposed to the s/w manager for ATLAS.

Steve reminded ROCs that BDII v3.1 sites should be in this list, and asked them to check that this was indeed the case.

WLCG Service Coordination

There was no WLCG coordinator at the meeting, as neither Jamie nor Harry were present.

WLCG issues coming from ROC reports

None were noted.

Upcoming WLCG Service Interventions

Steve hadn't checked, and suggested that people could do this for themselves anyway from the links on the agenda page.

ATLAS Service

  • Alessandro reported that some downtimes were not being broadcasted, and asked that GOCDB tick the checkbox by default. Gilles confirmed that this had been done.
  • Alessandro jokingly claimed that their downtime calendar which is built from the CIC RSS feed, is better than the LHCb one, and that it uses better colours to depict the Tiers. He suggested that it would be nice to have it implemented directly in the CIC portal, and both Cyril and Gilles seemed to be in agreement.
  • The BNL team ticket in GGUS problem was solved. BNL hadn’t updated the relevant Twiki page. The full test of the team ticket has not yet been completed, but will be done. Alessandro stressed that this was urgent, because "data-taking starts the day after tomorrow".
  • ATLAS have developed their own SRMv2 tests, which have been running since 10 days, to check space-tokens. Alessandro asked that sites check the ATLAS-srmv2 test results in the SAM portal to ensure that all space-tokens are working. The tests will be set to critical in a couple of weeks.

ALICE Service

CMS Service

LHCb Service

  • Roberto reported that a module for installing s/w was buggy, so LHCb had stopped submitting LHCb-specific CE tests to "avoid screwing up the shared area”. Thus, the past two weeks have no CE results.

Following the reports, Rolf (French ROC) suggested that e-mails be sent to the CIC developers with concrete examples of downtimes "that you weren’t alerted on". Both Alessandro and Roberto said there there were examples of GOCDB downtimes not being sent through the CIC portal, but that it might be difficult to track them down.

Roberto, addressing the GGUS developers, said that Team Tickets, conceptually, are not high-priority, but are rather ones that have to be shared. It is not necessary, therefore, that the default priority be "high".

OSG Items

Rob had no outstanding tickets in his system, but Maria had two complaints:

  • Escalation report GGUS:37059 (closed but reopened);
  • GGUS:39303 which she updated to “waiting for reply”

Rob described the continuing messaging outage (OSG send their test results and downtime information to GridView using the Message Bus), and expressed "concern on the OSG side". John confirmed that the brokers had been crashing recently, and that James was investigating. The good news is that no messages were lost - only delayed!

Action Items

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.OCC 2007-03-05 Example Action Item 2007-03-06 SteveTraylen   edit

Review of Open Action Items

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

Next Meeting

The next meeting will be Monday, 15 Sep 2008 15:00 UTC (16:00 Swiss mountain time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2008-09-12 - JohnShade
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback