OCC / CERN ROC: Antonio Retico, Nick Thackray, Maite Barroso
French ROC: Helene Cordier, Rolf Rumler
German/Swiss ROC:
IGALC: Ramon Diacovo
Italian ROC: Paolo Veronesi
Latin American ROC:
Northern Europe ROC:
Russian ROC:
South East Europe ROC:
South West Europe ROC: Christian Neissner
UK/Ireland ROC: Jeremy Coles
GGUS:
GOCDB:
WLCG Tier 1 Sites
ASGC: Absent
BNL: Absent
CERN site: Absent
FNAL: Absent
FZK:
IN2P3:
INFN:
NDGF:
PIC:
RAL:
SARA/NIKHEF: Absent
TRIUMF: Di Qing
Feedback on Last Week's Minutes
None was given.
EGEE Items
Grid Operator Hand Over on Duty
c-COD Team
From
ROC NE
To
ROC Italy
Report from cCOD:
Handover Log: Although it looks like the dashboard is full, the scenario is such:
Many AP sites haven't updated their alarms (in OK status) or tickets (expired on 12th) since the weekend. Some of the sites are in downtime, and therefore those alarms/tickets are currently ignored. Lastly, and this is one for the WLCG meeting: The apel situation does not appear to have been finally resolved:
"APEL Publication works normally and records are properly received, yet no update can currently be reflected in SAM or in the accounting portal. Note to operators: please ignore alarms on the APEL-Pub test until further notice."
Therefore, I advise a moratorium on these tickets/alarms until APEL tells us all is OK again. Cheers, Vera Hansper NE ROC/NDGF C-COD There were no questions.
The release to production had to be delayed by one day with respect to original plan due to a deployment issue affecting the dCache server and the analysis of an issue affecting the UI of one of the early adopter sites. The dCache server has been removed from the release and a new round of testing is now ongoing to validate the correct interactions of the clients with the existing service.
No further questions.
EGEE Items From ROC Reports
No major issues raised by any ROCs this week. (5/14) ROCs hadn't submitted the report at 3:30!
Nothing raised at the meeting.
Fixing MPI sites (from the MPI WG)
The SAM MPI tests are raising alarms from this morning, as agreed last week! Update received from Isabel Campos (MPI Task Force) last week here summarised. The current situation is the following:
There are 90 sites which publish the MPI-START tag, 88 are tested by SAM and 2 other sites (IFCA and RAL) are not tested because the way they publish the SubCluster info. Of those sites: 69 working fine (67 at SAM + IFCA + RAL); 20 errors; 2 are in maintenance.
This gives 76% of sites passing the tests (75% if we don't count the sites out of SAM)
For all the sites with errors a ticket in GGUS has been opened and most of them are working actively on finding a solution. There is a guide with the list of errors found and possible solutions for them at http://wiki.ifca.es/e-ciencia/index.php/MPI_Errors Documentation for MPI Support in EGEE:
https://twiki.cern.ch/twiki/bin/view/EGEE/MpiTools
More information about errors in the MPI knowledge DB:
http://wiki.ifca.es/e-ciencia/index.php/MPI_Errors
From today, the MPI tests have been made critical for raising alarms in the dashboard (not for the availability/reliability calculations).
Apel status update
APEL team: Had a large DB crash and recovery was not straight forward. In process of recovering. Will need to ask sites to republish for some time in Jan, but can't give the exact period yet. The system is not yet up and running. The warnings will not be fixed until the central DB is running correctly. Estimated date for completion: some time next week. Maite: Concerned that the recovery period - more than 2 weeks - exists for such a key central operations tool as APEL. Also, the communications could have been clearer - there was no update last week to EGEE. APEL team: Lessons have already been learned and will be implemented. But it's not clear to us the correct channels to use for communications. Maite: Frequent broadcasts would be fine.