WLCG-OSG-EGEE Ops' Minutes Tue 18 Mar 2008

Attendance

EGEE

  • Asia Pacific ROC: Min Tsai
  • Central Europe ROC: Marcin Radecki
  • OCC / CERN ROC: John Shade, Antonio Retico, Nick Thackray, Steve Traylen, Maite Barroso
  • French ROC: Absent?
  • German/Swiss ROC: Clemens Koerdt, Sven Hermann
  • Italian ROC: Alessandro Cavalli
  • Northern Europe ROC: Jules Wolfrat
  • Russian ROC: Lev, Victor Edneral
  • South East Europe ROC: Kostas Koumantaros
  • South West Europe ROC: Kai Neuffer, Gonzalo Merino
  • UK/Ireland ROC: Jeremy Coles
  • GGUS: Absent
  • OSCT: Absent

WLCG

  • WLCG Service Cordination: Harry Renshall, Jamie Shiers
  • ATLAS Alessandro di Girolamo
  • LHCb Roberto Santinelli

WLCG Tier 1 Sites

  • ASGC: Min Tsai
  • BNL: Absent
  • CERN site: Ignacio Reguero
  • FNAL: ?
  • FZK: Sven Hermann
  • IN2P3: Absent?
  • INFN: Daniele Bonacorsi
  • NDGF: Leif
  • PIC: Kai Neuffer
  • RAL: Derek Ross, Matt Hodges
  • SARA/NIKHEF: Absent
  • TRIUMF: Absent

Please, when logging in to the audio-conference web interface, specify your name and affiliation. This makes constructing the list of attendees a lot easier. What are the affiliations of Catalin? Emanouil Atanassov?

Reports Not Received

  • VOs: Alice, BioMed, LHCb
  • EGEE ROCs (Prod Sites): All reports received!

Meeting started at 16:04 (4 minutes late).

Jeremy Coles informed the meeting that no PPS report was received from UKI because there was no edit option for the report. A GGUS ticket has been raised for the CIC portal developers to investigate.

Feedback on Last Week's Minutes

No feedback.

EGEE Items

Grid Operator Hand Over on Duty

  Primary Team Secondary Team
From ROC Russia ROC DECH
To ROC AP ROC SEE

  • SRM problems at YerPhi - site responds, so can't be suspended. Better solution needed.
  • SLA being put into place at the moment. TAU-LCG2 (SE ROC) doing strict minimum to avoid suspension. Nick suggests suspending site (OCC decision) because they use too much support time. Kostas will give them a last chance & will suspend if necessary. Action: decision within two weeks.
  • Point 2. Lev explained that the site ru-Chernogolovka-IPCP-LCG2 was certified by error for 1/2 day (hence the alarms).
  • Point 3 (rgma node in GOCDB). Antonio explained that IC (Information Catalogue) needs to be used. GGUS:33927
  • Following recent SAM outage, SE ROC suggests monitoring for monitoring. John replied that this is a priority for the SAM team.
  • Long thread in LCC rollout about SAM csh test failing on Hepix nodes. Maite asked if someone had the sense to submit GGUS tickets. Update: This is due to a Hepix bug, but a new SAM sensor with workaround will shortly be released.

PPS Reports

  • No major issues. See meeting agenda for detailed release news.

EGEE Items From ROC Reports

  • Handling outages of SAM & other services not under a site's control - from Marcin, CE ROC. John explained that future SAM architecture will rely on buffering, and that SAM service reliability is top of the SAM team's agenda. For their availability reports, WLCG have decided to treat "no-data" periods as site being available (after the 24hrs timeout of the last valid SAM status). We could make the same assumption. Added complication is dependent services outside a site. John to continue the discussion with WLCG, GridView, Marcin et al.
  • Weekend SAM outage was due to a combination of an RPM bug and human over-sight. SAM team will implement local fabric monitoring to detect outages automatically in the future.
  • Lev (Russia): anonymous r/w access (with no grid certificate) is granted on DPM nodes with standard setup.This violates traceability & logging policy. DPM & dcache developers have been mailed and have promised authentication "really soon", but "really soon" means two months in dcache land. Nick suggested passing the buck to the OSCT, since there's a vulnerability of something in production.

gLite Release News

Antonio: not much to add to what was on the agenda page. Catalin from Fermilab(?): Before releasing certificates can you check that they're valid? After update, we don't want RPMs with invalid certificates! Antonio: This is a bug, please open a ticket. Nb. Distribution of certificates are not covered by release team.

WLCG Items

WLCG issues coming from ROC reports

  • None

Upcoming WLCG Service Interventions

  • In addition to those listed on the agenda page, Min gave notice of AGC downtime Wednesday 18/3/08 23:30 - 19/3/08 10:00 for Castor2 and Oracle RAC h/w migration.
  • 7-11 April: CERN to T1 backup links will be tested, there may be some instability.
  • One PIC tape robot will be unavailable for about 1 month. Gonzalo: upgrading to new robot (company hasn't provided dates), but service will not stop. Anything on old robot will need to be moved to new one. No problem for new data being written.

WLCG Service Coordination and CCR08 review

Harry: nothing in e-log since Thursday, which is good.

ATLAS Service

Alessandro:
  • continuing with Tier-1-Tier1 functional tests, and pushing sites to upgrade to SL4, since time is running out.
  • Downtimes (also discussed with other experiments). Asking for a web interface or CLI to retrieve foreseen downtimes.
Antonio said that the information was in SAM, but it was pointed out that SAM doesn't look into the future. Steve requested the list of questions Alessandro wants answered, since "the information is there" & should be easy to implement. Alessandro will formalize what he & other experiments want.

ALICE Service

CMS Service

Daniele Bonacorsi gave a very detailed report, the contents of which can be found on the meeting's agenda page here. There's been a lot of CMS activity, and only a few problems:
  • an increasing number of queued requests being investigated by CASTOR team
  • missing ProdAgent functionality being worked on by DM/WM developers
  • Tier 0 processing: dataset naming policy needs improvement
Daniele also announced that there would be a regular review of CMS-specific SAM tests.

LHCb Service

Roberto Santinelli mentioned that there hadn't been much activity due to software week. Some work was done with INFN to test TSM s/w, and 8TB were successfully transferred to CNAF prior to their scheduled downtime (next week and the one after).

OSG Items

No one on line.

Action Items

Newly Created Action Items

Assigned to Due date Description State Closed Notify  
Main.OCC 2007-03-05 Example Action Item 2007-03-06 SteveTraylen   edit

Review of Open Action Items

Open Action Items

IdSubmitterDescriptionCreationDueAssigned To 

Actions Closed in Last 20 Days

IdSubmitterDescriptionCreationDueAssigned ToClosed 

Next Meeting

The next meeting will be in two weeks due to the Easter holidays: Monday, 31 MAR 2008 15:00 UTC (16:00 Swiss time).

  • Attendees can join from 14:45 UTC (15:45 Swiss time) onwards.
  • The meeting will start promptly at 15:00 UTC (16:00 Swiss time).
  • The WLCG section will start at the fixed time of 15:30 UTC (16:30 Swiss time).
  • To dial in to the conference:
    • Dial +41227676000
    • Enter access code 0157610


These minutes can only be changed by members of:

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2008-03-21 - JohnShade
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    EGEE All webs login

This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright & by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Ask a support question or Send feedback