WLCG-OSG-EGEE Ops' Minutes Wed 17 Feb 2010

Summary

No summary yet.

Attendance

EGEE

  • Asia Pacific ROC:
  • Canadian ROC: Di Qing
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: Antonio Retico, Nick Thackray, Maite Barroso
  • French ROC: Helene Cordier, Rolf Rumler
  • German/Swiss ROC:
  • IGALC: Ramon Diacovo
  • Italian ROC: Paolo Veronesi
  • Latin American ROC:
  • Northern Europe ROC:
  • Russian ROC:
  • South East Europe ROC:
  • South West Europe ROC: Christian Neissner
  • UK/Ireland ROC: Jeremy Coles
  • GGUS:
  • GOCDB:

WLCG Tier 1 Sites

  • ASGC: Absent
  • BNL: Absent
  • CERN site: Absent
  • FNAL: Absent
  • FZK:
  • IN2P3:
  • INFN:
  • NDGF:
  • PIC:
  • RAL:
  • SARA/NIKHEF: Absent
  • TRIUMF: Di Qing

Feedback on Last Week's Minutes

None was given.

EGEE Items

Grid Operator Hand Over on Duty

  c-COD Team
From ROC NE
To ROC Italy

  • Report from cCOD:
Handover Log: Although it looks like the dashboard is full, the scenario is such:
Many AP sites haven't updated their alarms (in OK status) or tickets (expired on 12th) since the weekend. Some of the sites are in downtime, and therefore those alarms/tickets are currently ignored. Lastly, and this is one for the WLCG meeting: The apel situation does not appear to have been finally resolved:
"APEL Publication works normally and records are properly received, yet no update can currently be reflected in SAM or in the accounting portal. Note to operators: please ignore alarms on the APEL-Pub test until further notice."
Therefore, I advise a moratorium on these tickets/alarms until APEL tells us all is OK again. Cheers, Vera Hansper NE ROC/NDGF C-COD

There were no questions.

Sites Considered For Suspension
None.

Pilot Services Reports and Issues

Nothing to report.

gLite Release News

  • Staged rollout of several updates.
  • The release to production had to be delayed by one day with respect to original plan due to a deployment issue affecting the dCache server and the analysis of an issue affecting the UI of one of the early adopter sites. The dCache server has been removed from the release and a new round of testing is now ongoing to validate the correct interactions of the clients with the existing service.

No further questions.

EGEE Items From ROC Reports

  • No major issues raised by any ROCs this week. (5/14) ROCs hadn't submitted the report at 3:30!
  • Nothing raised at the meeting.

Fixing MPI sites (from the MPI WG)

The SAM MPI tests are raising alarms from this morning, as agreed last week!

Update received from Isabel Campos (MPI Task Force) last week here summarised. The current situation is the following:
There are 90 sites which publish the MPI-START tag, 88 are tested by SAM and 2 other sites (IFCA and RAL) are not tested because the way they publish the SubCluster info. Of those sites: 69 working fine (67 at SAM + IFCA + RAL); 20 errors; 2 are in maintenance.
This gives 76% of sites passing the tests (75% if we don't count the sites out of SAM)
For all the sites with errors a ticket in GGUS has been opened and most of them are working actively on finding a solution. There is a guide with the list of errors found and possible solutions for them at http://wiki.ifca.es/e-ciencia/index.php/MPI_Errors Documentation for MPI Support in EGEE: https://twiki.cern.ch/twiki/bin/view/EGEE/MpiTools More information about errors in the MPI knowledge DB: http://wiki.ifca.es/e-ciencia/index.php/MPI_Errors

From today, the MPI tests have been made critical for raising alarms in the dashboard (not for the availability/reliability calculations).

Apel status update

  • APEL team: Had a large DB crash and recovery was not straight forward. In process of recovering. Will need to ask sites to republish for some time in Jan, but can't give the exact period yet. The system is not yet up and running. The warnings will not be fixed until the central DB is running correctly. Estimated date for completion: some time next week.
    Maite: Concerned that the recovery period - more than 2 weeks - exists for such a key central operations tool as APEL. Also, the communications could have been clearer - there was no update last week to EGEE.
    APEL team: Lessons have already been learned and will be implemented. But it's not clear to us the correct channels to use for communications.
    Maite: Frequent broadcasts would be fine.

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.OCC 2007-03-05 Example Action Item 2007-03-06 SteveTraylen   edit

Review of Open Action Items

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

AOB

Nothing raised.

Next Meeting

The next meeting will be Monday, 22 February 2010 16:00 UTC+1

  • Attendees can join from 15:45 UTC+1 onwards.
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0148141


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-02-17 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback