WLCG-OSG-EGEE Ops' Minutes Mon 05 May 2008

Attendance

EGEE

  • Asia Pacific ROC: Min Tsai
  • Central Europe ROC: Malgorzata Krakowian
  • OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen, Diana Bosio, Maria Dimou
  • French ROC: David Bouvet, Cyril l’Orphelin, Pierre, Osman, Helène Cordier
  • German/Swiss ROC: Clemens Koerdt, Sven Hermann
  • Italian ROC: Absent
  • Northern Europe ROC: Absent
  • Russian ROC: Absent
  • South East Europe ROC: Absent
  • South West Europe ROC: Kai Neuffer, Gonzalo Merino
  • UK/Ireland ROC: Absent
  • GGUS: Guenter Grein
  • OSCT: Absent

WLCG

  • WLCG Service Cordination: Harry Renshall, Jamie Shiers

LHC VOs

  • ALICE: Simone Campana
  • ATLAS: Alessandro di Girolamo
  • CMS: Daniele Bonacorsi
  • LHCb: Roberto Santinelli

WLCG Tier 1 Sites

  • ASGC: Min Tsai
  • BNL: Absent
  • CERN site: Absent
  • FNAL: Absent
  • FZK: Sven Hermann, Clemens Koerdt
  • IN2P3: Pierre, Helène
  • INFN: Absent
  • NDGF: Anders Rhod Gregersen
  • PIC: Gonzalo Merino, Kai Neuffer
  • RAL: Absent
  • SARA/NIKHEF: Absent
  • TRIUMF: Absent

Today is an Orthodox (and UK?) holiday. The audio-conference GUI listed three Sunrise attendees, but the minute-taker has no idea who they were. Those hiding behind 0478930880 were, I think, David Bouvet, Cyril l’Orphelin, Pierre, Osman, Helène Cordier.

Reports Not Received

  • VOs: Alice, ATLAS, LBCb, BioMed
  • EGEE ROCs (Prod Sites): Italy, Russia, SE Europe, UK/I

Nick: Did ROCs have problems submitting CIC portal reports? Deafening silence was taken to be a “No”.

Feedback on Last Week's Minutes

None was given.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From ROC France NDGF
To ROC CERN ROC SE

Issues from ROC France:

  • 3 day retention period for deleted nodes in SAM. Answer: on SAM work plan, no firm date yet, but near the top of the list [*update:* will be implemented on 13/5/08]
  • YerPhi – do we keep them? No exclusion policy, but site asked to be suspended twice. A GGUS ticket is open & will be worked on tomorrow by Antonio (CERN is on-duty this week). [*update:* YerPhi has agreed to be, and has been, suspended]

Issues from NDGF COD:

  • Many alarms due to sites not updating their installed certificates. Yep!
  • NDGF on-duty reported that CIC portal alarm field needs toggled a few times before it actually sticks. This lag is normal, and due to propagation delays. Cyril explained that a refresh occurs every 2 minutes, so it’s a question of being patient.

Nick expressed his thanks to NDGF for having joined the COD on-duty rota.

PPS Reports (Antonio)

In Lyon, (France ROC) new 64-bit WN tarball distribution used. OK, but for an undefined LDAP dependency – YAM should be used instead. This is the 1st instance of 64-bit WNs in production, but they’re still only accessible through a CE in PPS (VOs invited to try it out).

Gonzalo from PIC: Is it only available as a tar ball? Any RPMs available? Antonio: in next release to production. Pierre: What to publish as CE architecture? WNs must be setup with compatibility libraries. Antonio: compatibility libraries are present. Pierre: Can it be put in production without causing problems for VOs? Antonio: Wait for production release currently being tested, or publish CE in your BDII with a special flag. Pierre: This won’t work, because CE will attract jobs.

Roberto said that LHC VOs check production status in JDL, so jobs won’t go to CE if its status is not production. Pierre said this wasn’t true, as CMS had been guilty of flooding a CE in February which was publishing a state of “test”. The conclusion was that the CE will only be advertised in the PPS BDII to avoid complications.

gLite Release News

No new gLite releases, either in validation or production, but a new release is being prepared. A summary was given by Antonio; please refer to agenda page for details.

Gonzalo: dCache (for server or clients)? Antonio: for server. Gonzalo: what about client? Updates will be to 64bit WNs, no news about 32bit releases. Nick: we can ask Oliver. CMS needs some client update, but no details given. Daniele: problem is how to use Crab with sites that have SRM v1 clients on their WNs against an SRM v2 installation. Problem is mainly with lcg-utils. Daniele will check whether a specific dCache client needed.

Update from Antonio: patch includes dCache server and client 1.8

Miscellaneous News

Nick: GGUS:35694 from UKI concerning intermittent lcg-CE segmentation faults. Cause is known (VOMS extension expiring causes LCAS to bomb). Bug is 35981, & fix will be fast-tracked.

EGEE Items From ROC Reports

  • ROC France: Sec FP test – write access test during WN installation. ROC doesn’t think that publishing the test results is a good idea. [*update from Romain:* traceability policy mandates that the check remains. Contact OSCT to discuss. Results are published, but not the details (except to a chosen few in the security team)].
  • ROC SW: PIC scheduled downtime – would like to know who gets e-mail. How to ensure VO managers actually received the notification? CIC: Author receives confirmation, and one can look at broadcast retrieval link in CIC portal. Maria suggested looking at mailing list web archives...

Heads up from Cyril(?): CIC portal is not authorized to post to certain mailing lists. That needs fixing. Nick will follow-up for CERN-hosted mailing lists, for others, contact list owners.

WLCG Items

Started at 16:45.

WLCG issues coming from ROC reports

* No issues

Upcoming WLCG Service Interventions

* Interventions: (see list in agenda, read out by Nick). No other interventions declared during meeting.

CCRC Review (Harry)

CCRC started today, until Friday 31st May. When asked for an inspirational message from the sponsor, Jamie replied “Good Luck” smile

eLog has burst back into life, entirely due to Roberto.

LHCb Service

LHCb (Roberto): All last week spent on testing all the steps before pushing the button. Moderately optimistic results. From notes taken with his organic word-processor (pencil), Roberto gave the following details:
  • PIT-T0 transfers: no problems
  • Tier0-Tier1: issue at CNAF with faulty disk server.
  • Reconstruction at Tier1s: issue at GridKA with empty s/w shared area. Issue accessing data at CNAF (see above). CPU time limit at NIKEF problem (ceiling increased, but still too low). Number of events per reconstruction will be reduced to 25000.
  • Latest version of reconstruction application not present at some Tier1s (IN2P3, RAL, GridKa)
  • Tier1-Tier1- all channels showed signs of success, but some proxy certificate renewal issues at NIKHEF (GGUS tickets opened). FTS at RAL needs to find out about CNAF Castor servers (update to service.xml file needed). FTS at RAL is contacting NIKEF using SRMv1 end-point. Transfers from GridKa to RAL were all timing out. GridKa has not set up LHCb DST space token in a path-independent manner. Some Tier1 FTS servers need to allow a higher number of concurrent active files.
  • After cleaning up February CCRC data at all sites, LHCb will quickly ramp-up to full-scale transfers.

ATLAS Service

  • Week 1: plan is to have functional test (distribution of fake data). Raw data will go to tape according to MoU. ESDs will go to a single Tier 1.
  • Wk2, 1 Teir1 will send data to all other Tier1s to test the full matrix.
  • Wk3: Throughput tests. Day1: 100% (of what?), 150% on second, 200% as of third day onwards (until stopped by ATLAS commissioning activities). Harry: This is 640MB/s to tape.
  • Wk4: still undefined, but probably catching up on things that didn’t work in the previous weeks.

6 e-logs have been entered since this morning & emails sent to mailing lists:

  • NDG LFC down
  • France & BNL have srm end-point failures
  • NL tape 100% failure
  • Taipei – problematic disk, tape is healthy

Harry: end-points shouldn’t have changed since February, so why are they failing? Simone: something must have happened last week – problems have been marked as urgent. Gonzalo: Do you plan to track site problems through GGUS or eLog + e-mail? Simone: the latter two for sure, and GGUS if time permits. Maria suggested opening a GGUS ticket and putting the desired e-mail addresses in the CC field (but the lists have restricted access). Doing the reverse has the problem of replies opening new, undesired, GGUS tickets. Maybe mailing to GGUS in BCC is the answer. Simone will try to use GGUS if it doesn’t cause too much overhead.

CMS Service

Daniele: see updated agenda page for his CMS contribution. Harry: Tier1 tape operations in week 3 are worth emphasising.

ALICE Service

Harry received a brief report from ALICE, who are slowly ramping up: all VO boxes need to be migrated to SLC4, and xrootd at several Tier2s needs setting up.

Core service VOMS RS upgraded at CERN at 15:00, took 45 minutes. Harry then gave the floor to any Tier1s, but noone took up the offer.

Daily 15:00 CCRC meeting with dial-in is resuming (except Mondays). Maria advertised her meeting tomorrow at 16:00 about GGUS-to-site message routing, suggesting that Tier1s and VOs join.

OSG Items

Alas, no-one from OSG was on-line to witness Maria’s wrath about GGUS:33220, marked very urgent, and “discussed 6 times in this meeting, commented on 75 times”, which is ridiculous. Maria would like closure, with a comprehensive answer for the GGUS Knowledge Base. Steve suggested that the actual problem is not very important, so the ticket was wrongly opened as “very urgent” at the outset. Regardless of the excuses, Maria would like Rob to act on the ticket. Nick will speak to Rob when he returns from holiday, and suggested that there should be a backup contact point for OSG. Maria said that there was, and that all OSG contacts were in the loop.

Action Items

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.OCC 2007-03-05 Example Action Item 2007-03-06 SteveTraylen   edit

Review of Open Action Items

Since the Action List had not been updated since the last meeting, this part was skipped – other than to note that for 136 & 137, GGUS tickets had not yet been opened against sites, which was the precondition to have these items removed from our list. ATLAS should do this (one ticket and get it cloned). The list will be followed-up off-line.

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

No AOB, meeting ended at 17:15.

Summary

Next Meeting

The next meeting will be Monday, 19 May 2008 15:00 UTC (16:00 Swiss local time).

  • Attendees can join from 14:45 UTC (15:45 Swiss local time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss local time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss local time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2008-07-10 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback