WLCG-OSG-EGEE Ops' Minutes Thu 11 Jun 2009
Summary
There will be a main outage of the RAL Tier1 (RAL-LCG2) will take place over a period of a couple of weeks at the end of June and early July. A blog
at:
http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuilding/
details the scheduled outages for component services.
Attendance
EGEE
- Asia Pacific ROC: Jason Shih
- Central Europe ROC: Malgorzata Krakowian
- OCC / CERN ROC: Antonio Retico, Diana Bosio, Nick Thackray
- French ROC: Pierre Girard
- German/Swiss ROC:
- Italian ROC:
- Northern Europe ROC: Ron Trompert,
- Russian ROC: Victor Edneral, Alexander Kryukov
- South East Europe ROC:
- South West Europe ROC: Christian Neissner
- UK/Ireland ROC: Jeremy Coles
- GGUS: Torsten Antoni, Guenter Grein
- GOCDB:
WLCG
- WLCG Service Coordination: Harry Renshall
WLCG Tier 1 Sites
- ASGC: Jason Shih
- BNL: Absent
- CERN site: absent
- FNAL: Catalin Dumitriescu
- FZK: Absent
- IN2P3: Pierre Girard
- INFN: Absent
- NDGF: Vera Hasper
- PIC: Absent
- RAL: Gareth Smith
- SARA/NIKHEF: Ron Trompert
- TRIUMF: Absent
Kyle Anthony
LHC Experiments
- ATLAS: absent
- LHCb: absent
- CMS: absent
- ALICE: absent
Feedback on Last Week's Minutes
None was given.
EGEE Items
Grid Operator Hand Over on Duty
|
"Old style" COD Team |
From |
Germany/Switzerland (DECH) |
To |
Russia |
- Report from "old style" COD:
Two GGUS tickets to report:
https://gus.fzk.de/ws/ticket_info.php?ticket=47749
,
https://gus.fzk.de/ws/ticket_info.php?ticket=48007
. The two tickets are for the SEE ROC, concering the same site and have
been opened since a long time.
UPDATE at the meeting: the tickets are closed in GGUS, so there might a problem in the interface with
the CIC portal, as the COD kept updating a closed ticket since April 30th.
UPDATE after the meeting: Site in SD until 20/06.
|
c-COD Team |
From |
North Europe (NE) |
To |
Asia Pacific (AP) |
Quiet week : nothing to report
Sites Considered For Suspension
None.
PPS Reports and Issues
- gLite 3.2 UPDATE 03: the new versions of the UI and WN on SL5.
https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_320_PPS_Update03
- gLite 3.1 UPDATE 48: new version of GFAL and lcg-utils.
https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update48
gLite Release News
- gLite 3.1 UPDATE 47:
- new version of gFAL and lcg-utils,
- new AMGA
- myproxy client on the WN
will be released this week.
The updates to fetch-crl scripts, and WN grid-cm-* packages were postponed.
EGEE Items From ROC Reports
- France : Since 01/06/2009, one of the regional Top BDII, hosted at GRIF, had some problem initially due to a air cooling system problem. GRIF WMS had consequently some problems because it was linked to this Top BDII.
UPDATE: GRIF
BDII has been moved to another site.
- France : IN2P3-CC, the MSS software update successfully ended on Friday. Dcache SE is now fully available.
- DECH : We needed to ban some users because of various things, first some jobs are completely filling /tmp (VOs icecube and biomed) and also there are hundreds of running jobs being killed by CPU time limit (ATLAS). The first two cases got quickly fixed via GGUS. The ATLAS case is still open since almost two weeks: https://gus.fzk.de/ws/ticket_info.php?ticket=49052
(Assigned to VOsupport) How should sites react in cases users got banned? LHC have alarm tickets to sites, how should sites approach the VOs?
The equivalent of ALARM tickets to VOs is a request from WLCG that is under discussion at GGUS. A restricted number of sites contacts
(one or two per site) will be able to contact VOs in an emergency. Otherwise for normal situation, the VO support unit is the way to contact the VO. If
the ticket is not dealt with appropriately, just escalate the ticket.
- SWE:During the migration of 32bit workers to 64bit PIC faced too many problems related to the dependencies of LHC software on 32/64bit libraries.
We are not happy with the situation of having production releases that are poorly tested against software of experiments (at least LHC): reference, e.g.
- thread in LCG-ROLLOUT: "libstdc++-devel.i386 and libstdc++-devel.
x86_64"
o Reply from Integration and Certification: we are working with
the Applications Area to produce a meta-rpm that pulls in the OS libraries
needed by the HEP VOs.
UPDATE: WLCG has a list of recommended packages/libraries.
Antonio: there is no immediate way to fix this at distribution level. But wecan work towards providing a robust way of performing a staged roll-out.
Christian: Are there Blueprints for SLAs for sites that want to join the first phase of the staged roll-out?
Antonio: not yet. The discussion will start tomorrow at the SA1 coordination meeting.
Grid Service Interventions
ALL TIMES IN UTC+2
Downtimes effecting the WLCG tier-1 sites:
- NDGF-T1: At risk: 08:00 9 Jun - 00:00 11 Jun. Services: Bergen will update the fimm cluster and the Tier1 machines (compute nodes, dcache machines, grid middleware servers) to Rocks 5.1 with CentOS 5.3 at UiB. Will degrade services a bit.
- RAL-LCG2: OUTAGE: 10:00 8 Jun - 10:00 15 Jun. Services: Relocation to new machine room [IN PROGRESS].
- NDGF-T1: OUTAGE: 00:15 8 Jun - 04:15 8 Jun. Services: GEANT's circuit provider will be performing maintenance on the dark fibre route COP-FRA.
- NDGF-T1: At Risk: 7:30 5 Jun - 15:00 8 Jun. Services: Some dCache pools crashed this morning. Some Atlas and Alice files will be unavailable until the pools have been brought online again. Most pools got back again, but two are still giving us problem. Investigation in progress. [IN PROGRESS]
UPDATE ON THE RAL OUTAGE: The RAL Tier1 (RAL-LCG2) will be moving its hardware to a new building. There is an entry in the agenda for today's (8th June) meeting referring
to an outage of the RAL Tier1 from 8-15 June for this move. This entry in the GOC DB is for a specific part of the service (the CE for the UK
NGS service).
The main outage of the RAL Tier1 (RAL-LCG2) will take place over a period of a couple of weeks at the end of June and early July. A blog
at:
http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuilding/
details the scheduled outages for component services.
* Discussion of open tickets for
OSG
It is now urgent to get an
OSG answer on the site email as
per
https://savannah.cern.ch/support/?107531
Ticket analysis done today by Guenter Grein:
- GGUS Ticket #49049 (OSG #6926) Ticket is in progress in GGUS but closed in OSG
- Reason
- GGUS received the "Closing" mail before the update mails that made the mail parser setting GGUS ticket into "in progress".
- Conclusion
- the mail parser works correctly, but problems occur in case of mail delays especially if sending more than 1 update mails in a short time slot -> I closed this ticket manually.
- GGUS Ticket #48962 (OSG #6924) Both tickets open -> ok
- GGUS Ticket #48737 (OSG #6922) Both tickets open -> ok
- GGUS Ticket #37059 (OSG #6926) Both tickets open -> ok
Newly Created Action Items
None.
Review of Open Action Items
Open Action Items
None.
Actions Closed in Last 20 Days
None.
AOB
Next Meeting
The next meeting will be Monday, 15 Jun 2009 14:00 UTC (16:00 Swiss local time).
- Attendees can join from 13:45 UTC (15:45 Swiss local time) onwards.
- The meeting will start promptly at 14:00 UTC (16:00 Swiss local time).
- To dial in to the conference:
- Dial +41227676000
- Enter access code 0148141
These minutes can only be changed by members of: