Please, when logging in to the audio-conference web interface, specify your name and affiliation. This makes constructing the list of attendees a lot easier. What are the affiliations of Catalin? Emanouil Atanassov?
Reports Not Received
VOs: Alice, BioMed, LHCb
EGEE ROCs (Prod Sites): All reports received!
Meeting started at 16:04 (4 minutes late).
Jeremy Coles informed the meeting that no PPS report was received from UKI because there was no edit option for the report. A GGUS ticket has been raised for the CIC portal developers to investigate.
Feedback on Last Week's Minutes
No feedback.
EGEE Items
Grid Operator Hand Over on Duty
Primary Team
Secondary Team
From
ROC Russia
ROC DECH
To
ROC AP
ROC SEE
SRM problems at YerPhi - site responds, so can't be suspended. Better solution needed.
SLA being put into place at the moment. TAU-LCG2 (SE ROC) doing strict minimum to avoid suspension. Nick suggests suspending site (OCC decision) because they use too much support time. Kostas will give them a last chance & will suspend if necessary. Action: decision within two weeks.
Point 2. Lev explained that the site ru-Chernogolovka-IPCP-LCG2 was certified by error for 1/2 day (hence the alarms).
Point 3 (rgma node in GOCDB). Antonio explained that IC (Information Catalogue) needs to be used. GGUS:33927
Following recent SAM outage, SE ROC suggests monitoring for monitoring. John replied that this is a priority for the SAM team.
Long thread in LCC rollout about SAM csh test failing on Hepix nodes. Maite asked if someone had the sense to submit GGUS tickets. Update: This is due to a Hepix bug, but a new SAM sensor with workaround will shortly be released.
PPS Reports
No major issues. See meeting agenda for detailed release news.
EGEE Items From ROC Reports
Handling outages of SAM & other services not under a site's control - from Marcin, CE ROC. John explained that future SAM architecture will rely on buffering, and that SAM service reliability is top of the SAM team's agenda. For their availability reports, WLCG have decided to treat "no-data" periods as site being available (after the 24hrs timeout of the last valid SAM status). We could make the same assumption. Added complication is dependent services outside a site. John to continue the discussion with WLCG, GridView, Marcin et al.
Weekend SAM outage was due to a combination of an RPM bug and human over-sight. SAM team will implement local fabric monitoring to detect outages automatically in the future.
Lev (Russia): anonymous r/w access (with no grid certificate) is granted on DPM nodes with standard setup.This violates traceability & logging policy. DPM & dcache developers have been mailed and have promised authentication "really soon", but "really soon" means two months in dcache land. Nick suggested passing the buck to the OSCT, since there's a vulnerability of something in production.
gLite Release News
Antonio: not much to add to what was on the agenda page.
Catalin from Fermilab(?): Before releasing certificates can you check that they're valid? After update, we don't want RPMs with invalid certificates!
Antonio: This is a bug, please open a ticket. Nb. Distribution of certificates are not covered by release team.
WLCG Items
WLCG issues coming from ROC reports
None
Upcoming WLCG Service Interventions
In addition to those listed on the agenda page, Min gave notice of AGC downtime Wednesday 18/3/08 23:30 - 19/3/08 10:00 for Castor2 and Oracle RAC h/w migration.
7-11 April: CERN to T1 backup links will be tested, there may be some instability.
One PIC tape robot will be unavailable for about 1 month. Gonzalo: upgrading to new robot (company hasn't provided dates), but service will not stop. Anything on old robot will need to be moved to new one. No problem for new data being written.
WLCG Service Coordination and CCR08 review
Harry: nothing in e-log since Thursday, which is good.
ATLAS Service
Alessandro:
continuing with Tier-1-Tier1 functional tests, and pushing sites to upgrade to SL4, since time is running out.
Downtimes (also discussed with other experiments). Asking for a web interface or CLI to retrieve foreseen downtimes.
Antonio said that the information was in SAM, but it was pointed out that SAM doesn't look into the future.
Steve requested the list of questions Alessandro wants answered, since "the information is there" & should be easy to implement. Alessandro will formalize what he & other experiments want.
ALICE Service
CMS Service
Daniele Bonacorsi gave a very detailed report, the contents of which can be found on the meeting's agenda page
here. There's been a lot of CMS activity, and only a few problems:
an increasing number of queued requests being investigated by CASTOR team
missing ProdAgent functionality being worked on by DM/WM developers
Daniele also announced that there would be a regular review of CMS-specific SAM tests.
LHCb Service
Roberto Santinelli mentioned that there hadn't been much activity due to software week. Some work was done with
INFN to test TSM s/w, and 8TB were successfully transferred to CNAF prior to their scheduled downtime (next week and the one after).