WLCG-OSG-EGEE Ops' Minutes Thu 11 Jun 2009

Summary

There will be a main outage of the RAL Tier1 (RAL-LCG2) will take place over a period of a couple of weeks at the end of June and early July. A blog at: http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuilding/ details the scheduled outages for component services.

Attendance

EGEE

  • Asia Pacific ROC: Jason Shih
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: Antonio Retico, Diana Bosio, Nick Thackray
  • French ROC: Pierre Girard
  • German/Swiss ROC:
  • Italian ROC:
  • Northern Europe ROC: Ron Trompert,
  • Russian ROC: Victor Edneral, Alexander Kryukov
  • South East Europe ROC:
  • South West Europe ROC: Christian Neissner
  • UK/Ireland ROC: Jeremy Coles
  • GGUS: Torsten Antoni, Guenter Grein
  • GOCDB:

WLCG

  • WLCG Service Coordination: Harry Renshall

WLCG Tier 1 Sites

  • ASGC: Jason Shih
  • BNL: Absent
  • CERN site: absent
  • FNAL: Catalin Dumitriescu
  • FZK: Absent
  • IN2P3: Pierre Girard
  • INFN: Absent
  • NDGF: Vera Hasper
  • PIC: Absent
  • RAL: Gareth Smith
  • SARA/NIKHEF: Ron Trompert
  • TRIUMF: Absent

OSG

Kyle Anthony

LHC Experiments

  • ATLAS: absent
  • LHCb: absent
  • CMS: absent
  • ALICE: absent

Feedback on Last Week's Minutes

None was given.

EGEE Items

Grid Operator Hand Over on Duty

  "Old style" COD Team
From Germany/Switzerland (DECH)
To Russia

  • Report from "old style" COD:

Two GGUS tickets to report: https://gus.fzk.de/ws/ticket_info.php?ticket=47749, https://gus.fzk.de/ws/ticket_info.php?ticket=48007. The two tickets are for the SEE ROC, concering the same site and have been opened since a long time.

UPDATE at the meeting: the tickets are closed in GGUS, so there might a problem in the interface with the CIC portal, as the COD kept updating a closed ticket since April 30th.

UPDATE after the meeting: Site in SD until 20/06.

  c-COD Team
From North Europe (NE)
To Asia Pacific (AP)

  • Report from cCOD:
Quiet week : nothing to report

Sites Considered For Suspension
None.

PPS Reports and Issues

  • gLite 3.2 UPDATE 03: the new versions of the UI and WN on SL5.
https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_320_PPS_Update03

  • gLite 3.1 UPDATE 48: new version of GFAL and lcg-utils.
https://twiki.cern.ch/twiki/bin/view/EGEE/PPSReleaseNotes_310_PPS_Update48

gLite Release News

  • gLite 3.1 UPDATE 47:
    • new version of gFAL and lcg-utils,
    • new AMGA
    • myproxy client on the WN

will be released this week.

The updates to fetch-crl scripts, and WN grid-cm-* packages were postponed.

EGEE Items From ROC Reports

  • France : Since 01/06/2009, one of the regional Top BDII, hosted at GRIF, had some problem initially due to a air cooling system problem. GRIF WMS had consequently some problems because it was linked to this Top BDII.

UPDATE: GRIF BDII has been moved to another site.

  • France : IN2P3-CC, the MSS software update successfully ended on Friday. Dcache SE is now fully available.

  • DECH : We needed to ban some users because of various things, first some jobs are completely filling /tmp (VOs icecube and biomed) and also there are hundreds of running jobs being killed by CPU time limit (ATLAS). The first two cases got quickly fixed via GGUS. The ATLAS case is still open since almost two weeks: https://gus.fzk.de/ws/ticket_info.php?ticket=49052 (Assigned to VOsupport) How should sites react in cases users got banned? LHC have alarm tickets to sites, how should sites approach the VOs?

The equivalent of ALARM tickets to VOs is a request from WLCG that is under discussion at GGUS. A restricted number of sites contacts (one or two per site) will be able to contact VOs in an emergency. Otherwise for normal situation, the VO support unit is the way to contact the VO. If the ticket is not dealt with appropriately, just escalate the ticket.

  • SWE:During the migration of 32bit workers to 64bit PIC faced too many problems related to the dependencies of LHC software on 32/64bit libraries.
We are not happy with the situation of having production releases that are poorly tested against software of experiments (at least LHC): reference, e.g. - thread in LCG-ROLLOUT: "libstdc++-devel.i386 and libstdc++-devel. x86_64" o Reply from Integration and Certification: we are working with the Applications Area to produce a meta-rpm that pulls in the OS libraries needed by the HEP VOs.

UPDATE: WLCG has a list of recommended packages/libraries.

Antonio: there is no immediate way to fix this at distribution level. But wecan work towards providing a robust way of performing a staged roll-out.

Christian: Are there Blueprints for SLAs for sites that want to join the first phase of the staged roll-out?

Antonio: not yet. The discussion will start tomorrow at the SA1 coordination meeting.

Grid Service Interventions

ALL TIMES IN UTC+2

Downtimes effecting the WLCG tier-1 sites:

  • NDGF-T1: At risk: 08:00 9 Jun - 00:00 11 Jun. Services: Bergen will update the fimm cluster and the Tier1 machines (compute nodes, dcache machines, grid middleware servers) to Rocks 5.1 with CentOS 5.3 at UiB. Will degrade services a bit.
  • RAL-LCG2: OUTAGE: 10:00 8 Jun - 10:00 15 Jun. Services: Relocation to new machine room [IN PROGRESS].
  • NDGF-T1: OUTAGE: 00:15 8 Jun - 04:15 8 Jun. Services: GEANT's circuit provider will be performing maintenance on the dark fibre route COP-FRA.
  • NDGF-T1: At Risk: 7:30 5 Jun - 15:00 8 Jun. Services: Some dCache pools crashed this morning. Some Atlas and Alice files will be unavailable until the pools have been brought online again. Most pools got back again, but two are still giving us problem. Investigation in progress. [IN PROGRESS]

UPDATE ON THE RAL OUTAGE: The RAL Tier1 (RAL-LCG2) will be moving its hardware to a new building. There is an entry in the agenda for today's (8th June) meeting referring to an outage of the RAL Tier1 from 8-15 June for this move. This entry in the GOC DB is for a specific part of the service (the CE for the UK NGS service).

The main outage of the RAL Tier1 (RAL-LCG2) will take place over a period of a couple of weeks at the end of June and early July. A blog at: http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuilding/ details the scheduled outages for component services.

OSG Items

* Discussion of open tickets for OSG

It is now urgent to get an OSG answer on the site email as per https://savannah.cern.ch/support/?107531

Ticket analysis done today by Guenter Grein:

  1. GGUS Ticket #49049 (OSG #6926) Ticket is in progress in GGUS but closed in OSG
    Reason
    GGUS received the "Closing" mail before the update mails that made the mail parser setting GGUS ticket into "in progress".
    Conclusion
    the mail parser works correctly, but problems occur in case of mail delays especially if sending more than 1 update mails in a short time slot -> I closed this ticket manually.

  1. GGUS Ticket #48962 (OSG #6924) Both tickets open -> ok

  1. GGUS Ticket #48737 (OSG #6922) Both tickets open -> ok

  1. GGUS Ticket #37059 (OSG #6926) Both tickets open -> ok

Newly Created Action Items

None.

Review of Open Action Items

Open Action Items

None.

Actions Closed in Last 20 Days

None.

AOB

Next Meeting

The next meeting will be Monday, 15 Jun 2009 14:00 UTC (16:00 Swiss local time).

  • Attendees can join from 13:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 14:00 UTC (16:00 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0148141


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2009-06-16 - DianaBosio
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback