WLCG-OSG-EGEE Op's Minutes Mon 14 Jan 2008
Attendance
EGEE
- Asia Pacific ROC: Min tsai
- Central Europe ROC: Marcin Radecki
- OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen
- French ROC: Gilles + lots of people ???
- German/Swiss ROC: Sven Hermann
- Italian ROC: Alessandro Cavalli
- Northern Europe ROC: Ron
- Russian ROC: Lev Shamardin
- South East Europe ROC: Kostas Koumantaros
- South West Europe ROC: Kai Neuffer
- UK/Ireland ROC: Matt Hodges, Derek Ross, Jeremy Coles
- GGUS: Maria Dimou (for User Support)
- OSCT: Absent
WLCG
- WLCG Service Cordination: Harry, Jamie
WLCG Tier 1 Sites
- ASGC: Min Tsai
- BNL: Absent
- CERN site: Harry Renshall
- FNAL: Rob Quick
- FZK: Sven Hermann
- IN2P3: ???
- INFN: Alessandro
- NDGF:
- PIC: Gonzalo
- RAL: Abesent
- SARA/NIKHEF: Ron
- TRIUMF: Rod Walker
Reports Not Received
- WLCG Tier 1s:
- VOs:
- EGEE ROCs (Prod Sites): CERN
- EGEE ROCs (PPS Sites): AP, CERN, IT
Happy New year
Got off to a late start due to static caused by Fermilab connection
Steve (chair) wished everyone a Happy New Year.
Feedback on Last Week's Minutes
None were given.
EGEE Items
Grid Operator Hand Over on Duty
|
Primary Team |
Secondary Team |
From |
Italy |
SouthWest |
To |
Central Europe |
France |
PPS Reports
Business as usual
EGEE Items From ROC Reports
- (ROC CE): It looks we have a central problem with accounting data. Listing of sites not publishing accounting data contains about 40 sites which suddenly stopped publishing in Dec 2007: http://www3.egee.cesga.es/acctenfor/nodata.php
Some sites in CE reported problems with APEL similar to a bug: https://savannah.cern.ch/bugs/?32435
Could APEL people comment on that?
- Derek(UKI) mentioned that the APEL box's certificate had expired - should have been fixed sometime last week. CE will check that it's OK.*
- (ROC CE): When could we expect MON BOX on SL(C)4? For sites using SL4 this is one of SL3 dependencies.
- Marcin(CE): SL4 MONBOX is needed in order to publish accounting records.
- Steve will ask Oliver to come next week for an update. He will look for more info. Other gLite services based on Tomcat are deployed on SLC4, so it shouldn't take too long to get things into production.
Upcoming gLite releases
* gL3.1 U10 --> Prod (~ Thursday) will contain, in particular
-
- glite-PX for glite 3.1
- gLite-AMGA_postgres for gLite 3.1
- VOBOX
- edg-mkgridmap-3.0.0 compatible with OpenSSL 0.9.7 * gL3.0 U38 --> Prod (~Thursday) will contain
- ~ 20 patches with bug fixes
WLCG Items
Tier1 Reports
- none received (business as usual)
WLCG issues coming from ROC reports
Upcoming WLCG Service Interventions
- Sven (DECH) asked to remove intervention at gridka (6 Nov) from the template
- Ron (NE): dcache off-line at SARA. Re-configuration to implement requirements on space management from CCRC08
FTS Service Review
none
ATLAS Service (Alessandro Di Girolamo)
- Storage Space:
Each site should publish in the Information System updated information in the following fields:
o GlueSAStateAvailableSpace
o GlueSATotalOnlineSize
o GlueSAUsedOnlineSize
for:
o each storage area with space tokens associated
o each storage area associated with "default spaces" for a given storage class
These informations are crucial for CCRC08
Thanks in advance
- Steve (OCC): Agrees that this is the correct solution but the publication of StorageSpace is not likely to be released
- Alessandro (Atlas): The issue concerns all VOs (LHCb agrees). We need correct storage space to be published in the information system. Currently ATLAS uses dpm-query or VOBOX to get information. Information does not always match with what returned by lcg-infosites (RAL, with castor, is an example of info mismatch). Sites running out of space need to be black-listed immediately and it is important to rely on the info provider. Currently ATLAS has to retrieve the information site by site without any standards. At the same time there are sites publishing correctly, so at least the T1 should make their info converge.
- Nick suggests to start with the T1s
- Agree on a deterministic recipe for the T1s to follow with Flavia (GSSD)
- open a set of tickets to the ROCs to follow up the change
- Alessandro: the deadline for Atlas is end of January
- Antonio (Cern ROC): with a good set of instruction (and a metric to verify) the ROCs can do it
- SE/SRM SAM critical tests for BNL Tier1 failing since mid December
- Steve: GGUS 31218. To me at least the ticket is wrong for problem. The problem is "No space for atlas"
- Rob will bring the issue at OSg ops meeting later in the afternoon
- ATLAS would know the status and the time schedule for srmls on lxplus: right now it is deployed only for CERN PPS.
- Steve: The command is apparently available
/afs/cern.ch/project/gd/LCG-share/current/d-cache/srm/bin/srmls
- Alessandro will verify if that is the one wanted
- Problems to retrieve group attributes from DPM (Point added during the meeting)
dpm-query is now more user friendly than before as it allows to retrieve more readable info than the bare UID or GID, but some sites in French ROC are still publishing in a way that dpm-query returns in the old fashion
- This is a good candidate for a GGUS ticket
- Alessandro opened it on-line
ALICE Service
nothing reported
CMS Service
nothing reported
LHCb Service (Roberto Santinelli)
- Roberto (LHCb) rfio problems at CNAF (and now also at RAL).
The problem (hanging connection in case the file on the SE is read from the WN
using rfio protocol) is under investigation by CASTOR people with support of CNAF
people. However being CNAF out of the production mask since months now (suffering
the accounting) we are looking for the shortest way to get it fixed: accessing files
through rootd rather than through rfiod. This has been proved to work at CERN
(where it is happily used).
I'd like to remind with this report this issue (that heavily penalizes computing mask of LHCb)
and to set some actions that should be addressed consistently:
1. CASTOR people + CNAF people to debug the rfio problem
2. CNAF people (to install,configure and test rootd). They got the support from FIO
and CASTOR people at CERN and it should foreseen for this week.
3. In case the recipe works at CNAF involve RAL people for the point 2.
- Derek (RAL): We were not aware of castor problems at RAL
- Roberto: the issue happened very recently at RAL as well. The suggestion is to wait for CNAf to get back in production and then apply the same fix at RAL
- Roberto: the aim of this report is to get commitment from the OPS team to follow-up this particular problems, namely for one site (CNAF) to sum up the state-of-art solution and then for the other WLCG sites to fix it where needed
- CNAF (with Luca dell'Agnello) is aware of this request and it is actively working. BTW LHCb cannot sunat CNAF since September.
- Steve asked a representative from CNAF to join the meeting next week. A castor expert from CERN wil join as well
- Alessandro Cavalli (CNAF) will transmit the request
WLCG Service Coordination
Maria (User Support) went through 4 old
OSG tickets still open. Rob will check them.
Review of Action Items
AOB
- Maria (User Support) VO YAIM configurator tool now linked to CIC portal - does anyone use it?
- ??? it needs to mature a bit first and a documentation page for sites to consult
- Kai (PIC) we have used it successfully. The output files where used for production after some manual modifications concerning local storage configuration
- Nick: Question will be asked of ROCs to volunteer to try out the tool
Next Meeting
The next meeting will be Monday, 21 Jan 2007 15:00 UTC (16:00 Swiss local time).
- Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
- The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
- The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
- To dial in to the conference:
- Dial +41227676000
- Enter access code 0157610
These minutes can only be changed by members of: