WLCG-OSG-EGEE Op's Minutes Mon 17 Dec 2007
Attendance
EGEE
- Asia Pacific ROC: Min
- Central Europe ROC: Marrten
- OCC / CERN ROC: John Shade, Nick Thackray, Steve Traylen
- French ROC: Gilles, Davide, Rolf, Osman, Cyril, Piere, Helen
- German/Swiss ROC: Sven
- Italian ROC: Absent
- Northern Europe ROC: Jules.
- Russian ROC: Lev
- South East Europe ROC: Kostas
- South West Europe ROC: Kai
- UK/Ireland ROC: Jeremy
- GGUS: Thorsten
- OSCT: Absent
WLCG
- WLCG Service Cordination: Absent
WLCG Tier 1 Sites
- ASGC: Min
- BNL: Absent
- CERN site: Absent
- FNAL: Joe
- FZK: Sven
- IN2P3: Piere
- INFN: Absent
- NDGF: Someone.
- PIC: Kai
- RAL: Derek and Matt.
- SARA/NIKHEF: Jules
- TRIUMF: Rod Walker
Reports Not Received
- WLCG Tier 1s:
- VOs:
- EGEE ROCs (Prod Sites):
- EGEE ROCs (PPS Sites):
Feedback on Last Week's Minutes
None were given.
EGEE Items
Copy In from Minutes
Grid Operator Hand Over on Duty
|
Primary Team |
Secondary Team |
From |
France |
Central Europe |
To |
CERN |
Taiwan |
It becomes more and more heavier to be COD with the synchronization problem between GOC DB and SAM. Tickets GGUS:30046 and GGUS:30306 are about this problem. Why does SAM need 3 days of retention (which becomes 2 weeks!!!) to update SAM DB when node is removed from GOCDB and information system? The reason of preventing GOCDB failure is not a good one. Moreover since the last update of GOC DB some old nodes with monitoring off in the previous version are now monitored. This has not been announced to site admins neither to COD. Site admins must delete now in GOCDB all old nodes which are not used anymore. And these nodes must be removed from the information system in order not to be monitored by SAM. Shall we need to open ticket against site about such nodes? |
The three day retention period is caused by SAM gathering information from GOCDB and the production
BDII. If a service node
then disappears it remains in SAMDB for subsequent testing for three days.
A summary of the GGUS tickets was given, a node deleted from
BDII and GOCDB. It then took two weeks for it to be deleted from Gridview
as is described in the ticket.
Lev, Please see
GGUS:22573
as well which may be related.
Judit says we can flags site in downtime , monitoring of and then remove node... Apparently impossible
to do a no-monitoring even in scheduled downtime. This will be followed up with an action item.
PPS Reports
- One site asked about setting up WNs in both PPS and Production environment e.g what to do about LCG_GFAL_INFOSYS. A recipe for at least what should be considered in this situation would be helpful. Nick thinks there may already be already be recipe in place? Also Nick commented that this may be super seeded by removing the need for production WNs anyway in PPS.
EGEE Items From ROC Reports
- ROC Central Europe
- _New GGUS ticket with no response (from Gridview Team) for 2 weeks: GGUS:30025
- There has been a response.
- France
- _ Downtime not taken into account by Gridview (GGUS #30042): Gridview says it comes from a bug in GOC DB synchronization script. Is it fixed? Mention about SAM removing tests in Gridview (GGUS #30044): no answer from Gridview
- Gridview say It is fixed.
- France
- BDII: With Glite3.0, it seems that GRIS is now implemented with a BDII (instead of globus-mds) for LCG-CE node. In such a case, is it still possible to combine LCG-CE and site BDII on the same machine ? If yes, how to configure this combined node with YAIM ? * It is possible , there is recipe in the install guide , Will add a pointer in the minutes. Lots have sites have this configuration so it does basically work. Raise a GGUS ticket if this is a particular problem then can be investigated.
- ROC North Europe
- Because of a security incident several certificates issued by the Dutch CA had to be revoked. However it was noticed that some services still accept revoked certificates a day later. Services still accepting certificate:
- CIC portal
- GOC database
- Web portal of VOMS server at Sara (this is under investigation by SARA)
- SAM Admin page
- Steve will create a generic howto for use EGEE certificates and crls with an apache web server. Action Created.
- Russia
- LHCb voms groups: Russian sites have been asked to update VO mappings again
- They have been unable to find a history of being asked to change the VO mappings for LHCb. How do we find out of the definitive current information?
- Answer from LHCb. LHCb gave a history of why they gave up asking sites to use their preferred customization of VO mappings.
- They have updated the VO cards and they are the definitive reference card. Any changes will be and have been announced at this meeting.
- Is there an archive of CIC broadcasts beyond one month. Helen, there is a longer history.
- Russia if you can not find the history please submit a GGUS tickets.
- Helen surmised that this feature was only added in December so there is only a sensible archive from this time, other things would have to be retrieved from other sources.
- UK/I
- Timeouts need reviewing
- Has been passed on to the SAM team and they are reviewing... There were comments on that any work should not just consider what timeout is but if a timeout should equal a fail. SAM would like a particular GGUS ticket with some details.
- ASIA/Pacific ROC
- A Beijing site asked Min and Taipei ROC for help. Leave to Min to decide if they want to support but CERN are happy to take them of course.
WLCG Items
LHCb and ATLAS both present.
LHCb
- GGUS:30562
- CNAF is not usable for reprocessing activity because files cannot be open through rfio protocol (stuck connection after file has been open). CNAF people are awaiting for CASTOR support in order to have instruction on this issue. One solution is to install rootd and access files through it (that cured a similar problem experienced at CERN in the past). I'd set a very urgent action on CASTOR support for giving the recommendations to CNAF guys to get the site back working.
- LHCB wants to emphasize how NIKHEF/SARA has not be used/(has felt to be not usable) because of continuous downtimes and problems at the Storage in the last weeks. Can we have a report from SARA?
Upcoming WLCG Service Interventions
* Triumf has 27th and 28th has a power cut.
* SARA-MATRIX in downtime , problems with PNFS.
* Site PIC downtime a couple of 19-20
* CERN xmas closure for two weeks, there is cover for all vital services.
FTS Service Review
ATLAS Service
ALICE Service
CMS Service
LHCb Service
Add in the Minutes Items
- grid-proxy-init comments
- Changes for grid-proxy-init with upgrade to gLite UI 3.1. Does not work with gLite 3.0 RB. Request to test things of course idealy before they go into production.
WLCG Service Coordination
Review of Action Items
AOB
Next Meeting
The next meeting will be Monday, 14 Jan 2008 15:00 UTC (16:00 Swiss local time).
- Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
- The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
- The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
- To dial in to the conference:
- Dial +41227676000
- Enter access code 0157610
These minutes can only be changed by members of: