WLCG Tier1 Service Coordination Minutes - 15 July 2010

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.9-5 (All)
SRM 2.9-3 (all)
None CASTOR and xroot plugin to be upgraded to 2.1.9-7 for all instances during technical stop. Name server memory upgrade. Changes should be implemented online without service interruption.
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
None None
BNL dCache 1.9.4-3 (PNFS) None None
SRM 2.8-5 (ALICE)
StoRM 1.5.1-3 (ATLAS, CMS, LHCb,ALICE)
FNAL dCache 1.9.5-10 (admin nodes) (PNFS)
dCache 1.9.5-12 (pool nodes)
None Will upgrade PNFS server hardware during technical stop
IN2P3 dCache 1.9.5-11 (Chimera)    
KIT dCache 1.9.5-15 (admin nodes) (Chimera)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
NDGF dCache 1.9.7 (head nodes) (Chimera)
dCache 1.9.5, 1.9.6 (pool nodes)
NL-T1 dCache 1.9.5-19 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-20rc1 (PNFS)   20/7: scheduled downtime from 0600 to 1800 for OS and firmware upgrades to storage, computing and Oracle (3D, FTS, LFC) services; FTS queues will be drained; dcache will be upgraded to dCache 1.9.5-21
RAL CASTOR 2.1.7-27 (stagers)
CASTOR 2.1.8-3 (nameserver central node)
CASTOR 2.1.8-17 (nameserver local node on SRM machines)
CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers)
SRM 2.8-2
None Plans are in place for upgrading to 2.1.9 later this year. We have completed stress+functional testing of 2.1.9 on our test systems, and will include end user testing from August. Information about the upgrade is available here
TRIUMF dCache 1.9.5-17 with Chimera namespace None New DB infrastrucure deployed to host TAGS + FTS DB, 2 hours FTS DT for July 19 to move to the new instance


SRM 2.9-4 is now officially available in the savannah release area. Full release notes and upgrade instructions are available.

dCache news

Nothing to report.

StoRM news

The version 1.5.3 has just been released and is available for installation. The release is currently available for SL4 only, but version 1.5.4 should be released also for SL5 during the first week of August. These are the highlights of version 1.5.3:
  • upgrades db schema script (from version 1.5.0 to 1.5.3)
  • updates startup/status script
  • fixes the output on startup of storm configuration
  • updates the welcome message
  • fixes the use of getent to retrieve the local group info
  • addes check to existence of local group used in path-authz and default-acl
  • provides a better handling of lcmaps log files
  • fixes the urgent bug on correct use of default value set on configuration file (aka volatile bug)
  • updates the use of concurrent package in scheduler component (get rid of backporting of java.util.concurrent)
  • fixes on garbage collecting of expired requests (modified the retrieve of nr of Expired Requests)
  • storm-backend-server-1.5.3-4.sl4 : startup script moved from /etc/init.d to /opt/storm/backend/etc/init.d ; logrotate cron file moved from /etc/cron.d to /opt/storm/backend/etc/logrotate.d
  • ig-yaiim-storm: - STORM varaibles moved from <site-info.def> to services/ig_se_storm* ; added symbolic link from /opt/storm/backend/etc/init.d/storm-backend to /etc/init.d/storm-backend ; added symbolic link from /opt/storm/backend/etc/logrotate.d/storm-backend.cron to /etc/cron.d/storm-backend.cron

Known issues: 1.5.3 does not support tape. Version 1.5.4 will support tape.

DPM news

See comments about LFC.

LFC news

LFC 1.7.4-6 is now the recommended version for SLC5. For SLC4 it is still 1.7.3 due to an issue with the VOMS library.

LFC 1.7.4-7 is in staged rollout but the only difference is some fixes for the Python 2.5 interface.

FTS news

FTS 2.2.5 (supporting sites without SRM and .lsc files) will enter certification next week.

Database services

  • Experiment reports:
    • ALICE:
      • Planned shut-down of ALIONR cluster on Monday 19th for storage array reboot
    • ATLAS:
      • New standby database for ATLR cluster is being installed at the moment as we were observing some problems that may be hardware related with the old standby DB
      • A new version of PVSS streams' apply handler has been deployed on 12th of July for Atlas online to offline replication. This change had been developed and tested together with Atlas in order to improve performance of DCS client tools.
    • CMS:
      • Problems with CMS trigger online application - follow-up in progress
      • Some problems with online -> offline conditions replication on 13th of July - single crash, restarted automatically during night
      • User errors caused PVSS replication to fail around 18:30 on 13th of July - transactions skipped, users notified
      • A replacement plan for hardware hosting CMS databases deployed at P5 has been agreed with CMS database coordinators. The plan will be implemented this autumn.
    • LHCb:
      • NTR

  • Site reports:
Site Status, recent changes, incidents, ...Sorted ascending Planned interventions
PIC Last week - rolled back PSU patch on ATLAS, LHC and LFC databases. Also audit was turned on for ATLAS and LHC DBs. During the Scheduled Downtime intervention, we're going to correct LAN problems in a FTS Database server, and upgrade firmware revisions of all the blades hosting Oracle. 6 a.m - 6 p.m. on 20th of July - series of interventions - firmware and OS upgrades affecting storage, computing and Oracle (3D, FTS, LFC) services.
IN2P3 Nothing to report None
NDGF Nothing to report None
RAL Nothing to report Multipath configuration changes based on this time table:
* Tuesday 20th 11:00 - 15:00 At Risk on OGMA (ATLAS).
* Wednesday 21st 10:00 - 14:00 At Risk on LUGH (LHCb).
* Thursday 22nd 10:00 - 14:00 At Risk on SOMNUS (LFC/FTS).
SARA Nothing to report No interventions
TRIUMF Nothing to report Monday July 19th - move of FTS Oracle RAC to new servers. In addition interventions are planned for ATLAS 3D Oracle RAC: Linux OS upgrade to RH Linux 5 & storage upgrades - short downtime on Tuesday 20th of July
BNL Nothing to report, waiting for PSU July and evaluating possibility of PSU April rollback and application of July PSU.  
KIT * Saturday 10.07.2010 - air condition failure at Gridka and part of infrastructure went down. Affected were also 3D Oralce RACs. As a consequence of this:
- LHCb
RACs were down for approximately 4 hours. The information about broken streams on the 3D Databases we got at 8:26PM (CET). After the intervention of DBA, at 10:36PM (CET) was LHCb and LFC/FTS Database online. Due to SAN failure ATLAS-DB still offline till 0:13AM day after (11.07.2010). Since 0:13AM are all 3D Databases in KIT-T1 100% online.
ASGC * SRM db high loading issue again last week, some ora_jxxx jobs occupying lots of memory without releasing for long time, under investigation.
* Still working on our Oracle RAC testbed verification.

-- JamieShiers - 13-Jul-2010

This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsWeb > Tier1ServiceCoordination > WLCGTier1ServiceCoordinationMinutes100715
Topic revision: r14 - 2010-07-15 - unknown
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback