WLCG-OSG-EGEE Ops' Minutes Mon 31 Aug 2009
Summary
Despite the fact that MPI deployment problems exist with the gLite SL5 WN the the SAM MPI tests can
be enabled since the sites published information system is interrogated first to check for MPI support
at at given site.
Attendance
EGEE
- Asia Pacific ROC: Jason Shih
- Central Europe ROC: Małgorzata Krakowian
- OCC / CERN ROC: John Shade, Antonio Retico, Steve Traylen, Maite
- French ROC:
- German/Swiss ROC:
- Italian ROC:
- Northern Europe ROC:
- Russian ROC:
- South East Europe ROC: Marios Chatziangelou
- South West Europe ROC: Christian Neissner,
- UK/Ireland ROC:
- GGUS: Helmut
- GOCDB:
Feedback on Last Week's Minutes
None was given.
EGEE Items
Grid Operator Hand Over on Duty
|
c-COD Team |
From |
Northern |
To |
Italy |
- The problems with Asia Pacific have now been resolved.
- Two sites (NE and AP) have overdue alarms, but I have informed both of them about this.
- No issues to report to the WLCG meeting, except to inform them that the AP problems are now resolved.
PPS Reports and Issues
- Please find Issues from EGEE ROCs and general info in: https://twiki.cern.ch/twiki/bin/view/LCG/OpsMeetingPps
- Site reports are now no longer given.
- New BDII, ICE , CREAM, FTS a lot.
- FTS is now SL5 with many new features and some obsoleted. See page above.
- CREAM at version 1.5 - support for LSF in blah.
- APEL records working where torque pbs_server on seperate host.
- Empty changes of information providers.
- dCache - Security update and bug fixes.
- New YAIM core and clients. A number of new variables have been added.
- SWAT renaming of GCM ( the old sam wrapper tests)
gLite Release News
EGEE Items From ROC Reports
Italy, France and UKI had not validated their ROC reports as of the 14:00 deadline.
Reports show no major operational issues encountered during the reporting period, and no points to raise at this meeting.
- FZK-LCG2: wishes to convey the following INFO: Planed downtime at FZK-LCG2 on 10-09-2009 07:00 - 08:00 UTC The LFC service lfc-fzk.gridka.de will be down (not LHCb LFC) due to splitting it into an ATLAS (atlas-lfc-fzk.gridka.de) and a non-ATLAS (lfc-fzk.gridka.de as before) one.
- SEE ROC: At the previous operations meeting it is briefly discussed the issue “WLCG MB agreed on 4th of August to ask for the SL5 migration at all Sites, including the Tier-2 Sites.”. As far as we know MPI it is still not supported by the glite-3.2 (see GGUS:47422
. We understand that this affects only the WLCG sites (at the moment), but since there are many users/teams in our region that they are depending on the MPI facility/capability of the Grid, we think that this issue could be given higher priority at the developers.
- SWE ROC: We d like to certify a site that runs only central services (WMS, LFC, etc..), the site has no storage or computing backend. Is this possible from the point of view of OPS? From meeting: Please go ahead of and create the site. Update the action item. Any problems please report. This is certainly a valid configuration.
- Problems of SL5 nodes? Is there a page somewhere with details on notes on SL5 for WLCG support. SL4toSL5wnMigration and SL5DependencyRPM.
Grid Service Interventions
- Consult links on the agenda page.
Misc Items.
# SAM default
DPM upgrade
Last reminder that the default
DPM used for SAM tests will be upgraded to SL4 next Monday 7th of September, and that sites with obsolete client S/W will start failing tests.
- SAM MPI tests will NOT be activated There are pending tickets for SL5
- Following discussion in the meeting the MPI tests do check what is published and as such sites with ill-working MPI will not fail MPI tests so long as they do not publish that they support MPI which is perfectly correct.
- Notification of new gstat beta version (see attached material)
- 7 Sites running legacy gLite releases, those not upgraded next week will be moved to suspended/uncertified till they do so:
Site Host Version
- EENet kriit.eenet.ee 3.0.2
- HK-HKU-CC-01 ce.grid.hku.hk 3.0.2
- JP-KEK-CRC-01 dg10.cc.kek.jp 3.0.2
- Taiwan-IPAS-LCG2 atlasce.phys.sinica.edu.tw 3.0.2
- Taiwan-NCUCC-LCG2 ce.cc.ncu.edu.tw 3.0.2
- TW-NTCU-HPC-01 host001.hpc.ntcu.edu.tw 3.0.2
- UKI-LT2-RHUL ce1.pp.rhul.ac.uk 3.0.2
OAT Items
GStat
GStat 2.0 Beta Release
The Beta release of GStat is now available. Installation and configuration instructions are available.
http://goc.grid.sinica.edu.tw/gocwiki/GSInstallationGuide
For any questions or comments, please email GStat support list.
project-grid-info-support@cernNOSPAMPLEASE.ch.
Nagios
Update to the EGEE SA1 OAT release
An update to the EGEE SA1 OAT release has now been released and is available in the usual repositories.
There are no changes to the
YAIM configuration required but it is necessary to rerun ncg.pl at least e.g via a
YAIM rerun following the "yum update" of your packages.
Changes include:
- Changes to grid-monitoring-probes-org.bdii probes with NCG providing configuration for them. Probe details: http://goc.grid.sinica.edu.tw/gocwiki/NagiosProbe
- Addition of org.gstat.CE and org.gstat.SE probes. These provide the sanity checks similar to those the gstat1 web interface provided. These are the gstat2 probes. In particular these look for greater compliance to the WLCG/EGEE glue schema usage documents.
- Nagios probe results that are collected via the messaging system now have their status prefixed with the hostname from where the test was executed. e.g For a ROC that submitted a WN test to site via a CE then the probe result once transmitted to the site nagios via msg service will appear as before as service "org.sam.WN-Bi-dteam-roc" on the CE node but the status line contains the WN name. e.g lxbra3908.cern.ch: OK: getCE: ce103.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam indicating that lxbra3908 was the WN where the test was executed.
Bug Fixes:
Install Instruction via
YAIM.
GridMonitoringNcgYaim
Bug Reports
https://savannah.cern.ch/projects/sa1tools/
Discussion Mailing List including pre-release announcements join
egee3-operations-automation-discuss@cernNOSPAMPLEASE.ch via
https://groups.cern.ch
Description of yum repositories including pretty repoview html pages and rss feeds of packages updates.
EGEESA1PackageRepository
Known Problems: We plan to deploy a bug fix to the production message brokers shorty that at times can cause consumers to fail to get messages.
The
OSG supporter wrote in the diary of
GGUS:49970
that the problem is solved, hence the ticket will be closed.
However, the corresponding OIM ticket 7148 is in Status: Support Agency. Therefore the GGUS ticket cannot be closed.
Please adapt the ticket status and put a comprehensive text in the Solution field for the GGUS Knowledge Data Base.
Comments from the meeting suggest that everything solved.
Newly Created Action Items
Review of Open Action Items
Both covered in the meeting.
Open Action Items
Id | Submitter | Description | Creation | Due | Assigned To | |
---|
Actions Closed in Last 20 Days
Id | Submitter | Description | Creation | Due | Assigned To | Closed | |
---|
AOB
Next Meeting
The next meeting will be Monday, dd mmm 2009 14:00 UTC (16:00 Swiss local time).
- Attendees can join from 13:45 UTC (15:45 Swiss local time) onwards.
- The meeting will start promptly at 14:00 UTC (16:00 Swiss local time).
- To dial in to the conference:
- Dial +41227676000
- Enter access code 0148141
These minutes can only be changed by members of: