WLCG Tier1 Service Coordination Minutes - 15 July 2010

  • Present: Edoardo, Patricia, Nicolo, Manuel, Tim, Harry, Jamie, Maria, Andrea, Maria A, Dawid, Pepe, Helge, Lola, MariaDZ, Angela, Rolf, Tiju, Rob, CNAF, Mattias/NDGF, Jon, Frederique, Federico/ LHCb, Vera, Ron, Carlos, TRIUMF-Andrew, Felix Lee, Jhen-Wei, Julia, IanF

Interventions Planned during LHC Technical stop

  • NDGF - may have followup intervention following this week's SE problems
  • Fermilab - PNFS work postponed until next TS
  • CERN - CASTOR transparent upgrades scheduled
  • TRIUMF - intervention on FTS DB
  • PIC & RAL - see agenda page and below.
    • PIC - I will not be able to connect to the next T1SCM, so this is just to remind that at PIC we are planning an Scheduled Intervention next Tuesday the 20th July, during the LHC technical stop days. Several interventions are planned. Most of them are related to firmware and OS upgrades affecting the Storage, Computing and Oracle (3D, FTS, LFC) services. The whole site will be declared hence in SD from 6 am until 6 pm on that day. Batch queues and FTS channels will be drained accordingly - Gonzalo
    • RAL Monday - Thursday 19-22 July. Site at Risk for transformer work (TBC)
      • Monday 19th July (08:00-14:00 UTC) - Outage on tape system for swap to spare controller.
      • Tuesday 20th July (07:00-13:00 UTC) - Outage on tape system for microcode update on tape robots.
        The transformer work is looking increasingly unlikely but remains scheduled for now.
        Not in the GOC DB yet - but proposed:
      • Tuesday 20th July. At Risk on Atlas 3D (ogma) for SAN multipath configuration update.
      • Wednesday 21st July. At Risk on LHCb 3d/FTS (lugh) for SAN multipath configuration update.
      • Thursday 22nd July. At Risk on LFC/FTS (somnus) for SAN multipath configuration update.

GGUS ticket follow up (see slides)

  • Atlas and CMS agreed their problems were very complex so the time it took or still takes to solve them is understandable.
  • LHCb will be very happy to get the detailed action list into its ticket against the Tier0. The other 2 LHCb cases concern Tier2s, who are not represented in this meeting.
  • In general, all supporters were reminded to use the GGUS ticket fields instead of email threads, in order to have the story complete when we perform analysis of the cases.
  • CERN Services felt it would be useful to import the GGUS 'Priority' value from GGUS, even if sometimes ticket submitters exagerate. Action on Maria. Progress will be recorded in https://savannah.cern.ch/support/?115704.

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.9-5 (All)
SRM 2.9-3 (all)
None CASTOR and xroot plugin to be upgraded to 2.1.9-7 for all instances during technical stop. Name server memory upgrade. Changes should be implemented online without service interruption.
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
None None
BNL dCache 1.9.4-3 (PNFS) None None
CNAF CASTOR 2.1.7-27 (ALICE)
SRM 2.8-5 (ALICE)
StoRM 1.5.1-3 (ATLAS, CMS, LHCb,ALICE)
   
FNAL dCache 1.9.5-10 (admin nodes) (PNFS)
dCache 1.9.5-12 (pool nodes)
None Will upgrade PNFS server hardware during technical stop
IN2P3 dCache 1.9.5-11 (Chimera)    
KIT dCache 1.9.5-15 (admin nodes) (Chimera)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
     
NDGF dCache 1.9.7 (head nodes) (Chimera)
dCache 1.9.5, 1.9.6 (pool nodes)
   
NL-T1 dCache 1.9.5-19 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-20rc1 (PNFS)   20/7: scheduled downtime from 0600 to 1800 for OS and firmware upgrades to storage, computing and Oracle (3D, FTS, LFC) services; FTS queues will be drained; dcache will be upgraded to dCache 1.9.5-21
RAL CASTOR 2.1.7-27 (stagers)
CASTOR 2.1.8-3 (nameserver central node)
CASTOR 2.1.8-17 (nameserver local node on SRM machines)
CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers)
SRM 2.8-2
None Plans are in place for upgrading to 2.1.9 later this year. We have completed stress+functional testing of 2.1.9 on our test systems, and will include end user testing from August. Information about the upgrade is available here
TRIUMF dCache 1.9.5-17 with Chimera namespace None Storage firmware upgrade durint technical stop
New DB infrastrucure deployed to host TAGS + FTS DB, 2 hours FTS DT for July 19 to move to the new instance

CASTOR news

SRM 2.9-4 is now officially available in the savannah release area. Full release notes and upgrade instructions are available.

dCache news

Nothing to report.

StoRM news

The version 1.5.3 has just been released and is available for installation. The release is currently available for SL4 only, but version 1.5.4 should be released also for SL5 during the first week of August.

Known issues: 1.5.3 does not support tape. Version 1.5.4 will support tape.

DPM news

See comments about LFC.

LFC news

LFC 1.7.4-6 is now the recommended version for SLC5. For SLC4 it is still 1.7.3 due to an issue with the VOMS library.

LFC 1.7.4-7 is in staged rollout but the only difference is some fixes for the Python 2.5 interface.

FTS news

FTS 2.2.5 (supporting sites without SRM and .lsc files) will enter certification next week.

Database services

  • Experiment reports:
    • ALICE:
      • Planned shut-down of ALIONR cluster on Monday 19th for storage array reboot
    • ATLAS:
      • New standby database for ATLR cluster is being installed at the moment as we were observing some problems that may be hardware related with the old standby DB
      • A new version of PVSS streams' apply handler has been deployed on 12th of July for Atlas online to offline replication. This change had been developed and tested together with Atlas in order to improve performance of DCS client tools.
    • CMS:
      • Problems with CMS trigger online application - follow-up in progress
      • Some problems with online -> offline conditions replication on 13th of July - single crash, restarted automatically during night
      • User errors caused PVSS replication to fail around 18:30 on 13th of July - transactions skipped, users notified
      • A replacement plan for hardware hosting CMS databases deployed at P5 has been agreed with CMS database coordinators. The plan will be implemented this autumn.
    • LHCb:
      • NTR

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC * SRM db high loading issue again last week, some ora_jxxx jobs occupying lots of memory without releasing for long time, under investigation.
* Still working on our Oracle RAC testbed verification.
None
BNL Nothing to report, waiting for PSU July and evaluating possibility of PSU April rollback and application of July PSU.  
CNAF    
KIT * Saturday 10.07.2010 - air condition failure at Gridka and part of infrastructure went down. Affected were also 3D Oralce RACs. As a consequence of this:
- ATLAS
- LHCb
- LFC/FTS
RACs were down for approximately 4 hours. The information about broken streams on the 3D Databases we got at 8:26PM (CET). After the intervention of DBA, at 10:36PM (CET) was LHCb and LFC/FTS Database online. Due to SAN failure ATLAS-DB still offline till 0:13AM day after (11.07.2010). Since 0:13AM are all 3D Databases in KIT-T1 100% online.
None
IN2P3 Nothing to report None
NDGF Nothing to report None
PIC Last week - rolled back PSU patch on ATLAS, LHC and LFC databases. Also audit was turned on for ATLAS and LHC DBs. During the Scheduled Downtime intervention, we're going to correct LAN problems in a FTS Database server, and upgrade firmware revisions of all the blades hosting Oracle. 6 a.m - 6 p.m. on 20th of July - series of interventions - firmware and OS upgrades affecting storage, computing and Oracle (3D, FTS, LFC) services.
RAL Nothing to report Multipath configuration changes based on this time table:
* Tuesday 20th 11:00 - 15:00 At Risk on OGMA (ATLAS).
* Wednesday 21st 10:00 - 14:00 At Risk on LUGH (LHCb).
* Thursday 22nd 10:00 - 14:00 At Risk on SOMNUS (LFC/FTS).
SARA Nothing to report No interventions
TRIUMF Nothing to report Monday July 19th - move of FTS Oracle RAC to new servers. In addition interventions are planned for ATLAS 3D Oracle RAC: Linux OS upgrade to RH Linux 5 & storage upgrades - short downtime on Tuesday 20th of July

-- JamieShiers - 13-Jul-2010

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2010-07-15 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback