WLCG-OSG-EGEE Ops' Minutes Mon 01 Mar 2010

Summary

Incident affecting Apel central services completely recovered.
Tuesday 2nd is the last days for sites to upgrade CAs before alarms are raised.
FTS version 2.2.3 (SL4) released to production

Attendance

EGEE

  • Asia Pacific ROC: ShuTing Liao
  • Canadian ROC: Di Qing
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: Antonio Retico, Nick Thackray
  • French ROC: Pierre Girard, Osman Aidel
  • German/Swiss ROC: Angela Poschlad
  • Italian ROC: Alessandro Paolini
  • Latin American ROC: Renato Santana
  • ROC IGALC: Ramon Diacovo
  • Northern Europe ROC: Ron Trompert
  • Russian ROC: Lev Shamardin, Victor Edneral
  • South East Europe ROC: Marios
  • South West Europe ROC: Christian Neissner, Gonzalo Merino
  • UK/Ireland ROC: Jeremy Coles
  • GGUS: Torsten Antoni, Helmut Dres
  • GOCDB:

WLCG Tier 1 Sites

  • ASGC: ShuTing Liao
  • BNL: Absent
  • CERN site: Harry Renshall
  • FNAL: Joe Kaiser
  • FZK: Angela Poschlad
  • IN2P3: Pierre Girard
  • INFN: Alessandro
  • NDGF: Leif
  • PIC: Gonzalo
  • RAL: Gareth Smith
  • SARA/NIKHEF: Absent
  • TRIUMF: Absent

Feedback on Last Week's Minutes

None was given.

EGEE Items

Grid Operator Hand Over on Duty

  c-COD Team
From ROC France
To ROC Central Europe

  • Report from cCOD:

See report attached to the Agenda.

Apel failures

Comments: David (SWE): For the sites in SWE the issues are local and highly technical, therefore raised to the developers support. A timeline for the fix is not available yet. Nick: should the time for solution grow longer than one week is advisable to put the sites in maintenance in order to reduce noise

MPI failures

Cyril commented to the failure observed in the MPI tests. He wasn't looking at the results directly, so he cannot answer on the existence of trends in the failures

James noticed that from discussions held in the SAM --> Nagios transition context an evaluation was made and the sites failing the MPI tests seem now to be reduced to around five/six. It was noticed however that Nagios is not ready yet to test MPI. A timelline for this feature has been set in a week from now, which makes Isabel happy

Sites Considered For Suspension

Pilot Services Reports and Issues

gLite Release News

gLite 3.1 Update 61 introduces, among others, the long awaited FTS version 2.2.3. of particular interest for the sites also the new version of the host certificate of lcg-voms.cern.ch

A new bundle of patches on the 3.1 baseline is being processed these days. Tasks for the early adopters sites will be opened tomorrow.

EGEE Items From ROC Reports

  • ROC SWE: Lots of open tickets related to APEL. There are basically small, quite unresponsive sites affected. ROC will provide a document with a timeline to solve the problems of those sites.
    the ROC will undertake appropriate action

  • ROC SWE: On behalf of Mario David, once more the discussion on HEPSPEC06 was raised:
    1. As a question of principles, can EGEE/EGI force sites to install non-free software?
    Nick
    EGEE perhaps can't, EGI may want to, this is a question eventually to be brought up at the EGI transition meeting
    1. Related to this, upcoming sites might want tables of published HEPSPEC06 values related to specific hardware in order to use that reference instead of running the benchmark in their computing back-end.
    Nick
    this is against the principle itself of of benchmarking and these tables have in the past to contain largely incorrect and sometimes biased values. The fear was expressed that the imposition of a non-free software may become a precedent in the infrastructure. This is understandable but I think that it would be counter-productive for EGI to go along this line. Furthermore this risk is too theoretical to be discussed in this session. It would be like to infer that having a police causes necessarily a country to evolve into a state of police_
    _Renato Santana (ROC_LA) reminded that durign one of the recent SA1 coordination meeting it was decided to set a deadline for the sites to run the benchmark. HE asks for this deadline to be clarified as it is difficult for the sites to go through the needed bureaucracy. Nick will investigate and reply

Migration from SAM to Nagios

Link
https://cic.gridops.org/index.php?section=roc&page=broadcast_archive&step=2&typeb=C&idbroadcast=45487

James: the tests results were evaluated byy the ROD team. Issues were raised, some were new (e.g. the need for MPI testing), other corresponding to already tracked bugs. However the overall feedback was positive modulo some physiological fluctuations in the result comparison and there were no showstoppers. So this morning Cyrill flipped the big switch at around 11 AM and now the alarms from the dashboard are generated by Nagios tests. Availability records will be still calculated with SAM tests for the month of March and the resulting figures compared to those generated by Nagios. If they look compatible the report will be fed by Magios already in April. otherwise the exercise will be repeated.

Ticked for sites are being opened by Nagions and associated to the category 'Nagios' in GGUS

In synthesys now we have

  • Operations (Nagios)
  • Avaliability (Sam)

Apel status update

The incident affecting Apel central services seems to be completely recovered.

The information about the status in http://goc.grid.sinica.edu.tw/gocwiki/ApelIssues-Jan_Feb_2010 will be kept up-to-date more regularly

CA update

Tomorrow is the last day for sites to perform the CA upgrade before they start receiving alarms

Grid Service Interventions

  • Consult links on the agenda page.

Miscellaneous

none

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.OCC 2007-03-05 Example Action Item 2007-03-06 SteveTraylen   edit

Review of Open Action Items

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

AOB

none

Next Meeting

The next meeting will be Monday, 8 Mar 2010 14:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0148141


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-03-01 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback